Versioning System for Modeling Environmental Data based on an Automatic Meta-Data Generation Strategy

22.12.2016 // SCADS

The Helmholtz-Centre for Environmental Research (UFZ) is one of the world’s leading research centres in the field of Earth system science. The Department of Environmental Informatics of the UFZ develops software for the simulation of environmental phenomena via coupled thermal, hydrological, mechanical and chemical processes by using innovative, numerical methods. Examples include the prediction of groundwater contamintion, the development of water management schemes or the simulation of innovative means of energy storage.

Modeling Process

The modeling process is a complete workflow, starting with data acquisition and -integration to process simulation to analysis and visualization of calculated results. Unfortunately, this modeling process is not transparent and traceable and often poorly documented. A typical model is developed over many weeks or months. Usually, large numbers of revisions are necessary for updating and refining the model such that the simulation is as exact as possible. The first setup of a model is often used to get an overview over existing data. Also it helpfs to detect potential problems in both data and numerical requirements. Further revisions try to solve these problems by adding data, refining or adjusting finite element meshes or updating and adjusting processes and their parametrization.

Both input- and parameter files range from few/small files up to hundreds of files containing detailed spatial, temporal or numerical information. Likewise, changes from one modeling step to the next may be small (e.g. one parameter value in a single input file) or major (e.g. geometrical input changes and requires a new discretization of the FEM domain as well as a new parameterization).

The current solution at the Department of Environmental Informatics is fully file based, usually stored locally. Thereby, the overview over the evolution of a model is easily lost. Furthermore, it is difficult to trace the order and nature of previous changes. Collaborative work on a single model is also difficult. An enormous deficiency of the current solution is that there is no implicit documentation of the changes and each user stores a log of such changes on their own laptop at their discretion.

Our Solution

Within ScaDS, we develop a solution that circumvents those impediments. This solution is based on introducing a uniform, central and consistent storage of the individual revisions, so that each scientist

  • is able to view the simulation data they are entitled to
  • has a backup if the local data is lost or corrupted
  • has the possibility to automatically track, analyze and evaluate the changes in each modeling step.

We use KITDM (Karlsruhe Institute of Technology Data Manager) as the software architecture for building up repositories for research data. It provides a modular and extensible framework to adapt to the needs of various scientific communities and use cases. Also, it employs established standards and standard technologies. The features include:

  • adjustable data storage and data organization
  • easy to use interfaces
  • high performance data transfer
  • a flexible role-based security model for easy sharing of data
  • offers flexible meta-data indexing and searching functionality
  • seamless integration in MASI

Meta-data Management for Applied Sciences (MASI)

MASI is a current research project which aims to establish a generic meta-data management for scientific data with research focusing on generic description of meta-data, generic backup and recovery strategies based on meta-data management and enhanced subsequent processing of data. The main partners within the MASI project are:

For the storage of environmental model revisions, we employ an efficient and disk saving storage strategy. Specific parameter files are stored only if their content has actually been changed in the latest revision (selective upload strategy). The model development is stored in a tree structure. Each node (revision) has a unique link to its predecessor (i.e. the revision from which it has been derived). Due to the tree structure, some nodes (revisions) may become irrelevant, if a certain modeling approach has been rejected and developement is progressed from a previous iteration. Meta data for a modeling project are stored in a meta-data file and thus contain a basic documentation which is automatically generated during the development of a model. Examples for meta data include:

  • name of the project
  • researcher in charge
  • modifications to a certain file in a selected revision
  • tree structure of all revisions

Additional meta-data files can be generated as needed by the scientists developing the model. In this way, a preliminary documentation regarding the development of the model is created at the initial set-up of a model and complemented after each revision.

The meta-data files can be evaluated and searched using a KITDM GUI or via Elastic Search. A correspondent download strategy to the selective upload is used. For a specific revision the latest version of all relevant files are downloaded, such that the user always has a complete model available after download.


Karsten Rink:

Check out more news about ScaDS.AI Dresden/Leipzig at our Blog.