Earth System Data Cubes

Multivariate Earth System Data Cubes (Figure by Maximilian Söchting, RSC4Earth, Leipzig University)

The concept of the Earth System Data Cube (Mahecha et al. 2020) rapidly turned into a popular tool in Earth System Sciences during the last years as it tremendously facilitates data visualization and (interoperable) data handling, including preprocessing or statistical analyses. The original data sets are transformed in space and time to fit to the common grid of the Data Cube which consists of three dimensions: longitude, latitude and time, and further holds a set of variables that are mapped into this spatio-temporal system. Data Cubes are typically chunked, meaning they consist of a set of smaller cubes (chunks) which together build what we call the Earth System Data Cube (ESDC). The ESDC concept allows to treat multiple remotely sensed spatio-temporal data streams as a singular one and therefore enables to interact with a wide range of data.

A parallel development is the growing need for the application of Machine Learning methods to Earth System Sciences data as most parts of the Earth system are continuously monitored by sensors and Machine Learning is able to cope with both the volume of data and the heterogeneous data characteristics. Ideally, classical operations on the ESDC could be extended by Machine Learning applications in order to sustain interoperability. However, there is a conflict between the nature of remotely-sensed data, the structure of the ESDC and the requirements for meaningful Machine Learning applications which need to be addressed:

  1. Sampling the Earth naturally leads to an uneven distribution of data points as a result of its spherical shape. This phenomenon is reinforced by data gaps due to e.g., satellite trajectories or cloud cover. Hence, there is no uniform data distribution across the chunks of the ESDC provided.
  2. Remotely sensed data tends to be auto-correlated within (neighboring) chunks as data points which are in close spatio-temporal vicinity are naturally characterized by a low variance.

Therefore, it is mandatory to enable Machine Learning that respects the basic principles of geo-data way beyond naive applications of Machine Learning in the Earth system context. We focus on the development of sophisticated and efficient sampling strategies for Data Cubes and ML tools that can operate on this large cloud-hosted data sets.

More to it: