Versioning system for modeling environmental data based on an automatic meta-data generation strategy

12/22/2016 // SCADS

Published on 22 December 2016

The Helmholtz-Centre for Environmental Research (UFZ) is one of the world’s leading research centres in the field of Earth system science. The Department of Environmental Informatics of the UFZ develops software for the simulation of environmental phenomena via coupled thermal, hydrological, mechanical and chemical processes by using innovative, numerical methods. Examples include the prediction of groundwater contamintion, the development of water management schemes or the simulation of innovative means of energy storage. The modeling process is a complete workflow, starting with data acquisition and -integration to process simulation to analysis and visualization of calculated results.

Unfortunately this modeling process is not transparent and traceable and often poorly documented. A typical model is developed over many weeks or months and usually  a large number of revisions are necessary for updating and refining the model such that the simulation is as exact as possible. The first setup of a model is often used to get an overview over existing data and to detect potential problems in both data and numerical requirements. Further revisions try to solve these problems by adding data, refining or adjusting finite element meshes or updating and ajusting processes and their parametrization. Both input- and parameter files range from few/small files up to hundreds of files containing detailed spatial, temporal or numerical information. Likewise, changes from one modeling step to the next may be small (e.g. one parameter value in a single input file) or major (e.g. geometrical input changes and requires a new discretization of the FEM domain as well as a new parameterization).

The current solution at the Department of Environmental Informatics is fully file based, usually stored locally on the laptop of each scientist. Thereby, the overview over the evolution of a model is easily lost and it is difficult to trace the order and nature of previous changes. Collaborative work on a single model is also difficult. An enormous deficiency of the current solution is that there is no implicit documentation of the changes and each user stores a log of such changes on their own laptop at their discretion.

Within ScaDS we develop a solution that circumvents those impediments by introducing a uniform, central and consistent storage of the individual revisions, such that each scientist
* is able to view the simulation data they are entitled to
* has a backup if the local data is lost or corrupted
* has the possibility to automatically track, analyze and evaluate the changes in each modeling step.

We use KITDM (Karlsruhe Institute of Technology Data Manager) as the software architecture for building up repositories for research data. It provides a modular and extensible framework to adapt to the needs of various scientific communities and use cases and employs established standards and standard technologies. It provides adjustable data storage and data organization, easy to use interfaces, high performance data transfer, a flexible role-based security model for easy sharing of data and offers flexible meta-data indexing and searching functionality. It is seamlessly integrated in MASI (Meta-data Management for Applied Sciences), a current research project which aims at establishing a generic meta-data management for scientific data with research focusing on generic description of meta-data, generic backup and recovery strategies based on meta-data management and enhanced subsequent processing of data. The main partners within the MASI project are TU Dresden – ZIH (Center for Information Services and High Performance Computing) and the Karlsruhe Institute of Technology.

For the storage of environmental model revisions, we employ an efficient and <93>disk saving<94> storage strategy such that specific parameter files are stored only if their content has actually been changed in the latest revision (selective upload strategy). The model development is stored in a tree structure, each node (revision) has a unique link to its predecessor (i.e. the revision from which it has been derived). Due to the tree structure, some nodes(revisions) may become irrelevant if a certain modeling approach has been rejected and developement is progressed from a previous iteration. Meta data for a modeling project (e.g. the name of the project, the researcher in charge, the modifications to a certain file in a selected revision, or the tree structure of all revisions) are stored in a meta-data file and thus contain a basic documentation which is automatically generated during the development of a model. Additional meta-data files can be generated as needed by the scientists developing the model. In this way, a preliminary documentation regarding the development of the model is created at the initial set-up of a model and complemented after each revision.
The meta-data files can be evaluated and searched using a KITDM GUI or via Elastic Search. A correspondent download strategy to the “selective upload” is used, such that the for a specific revision the latest version of all relevant files are downloaded, such that the user always has a complete model available after download.


Martin Zinner:
Karsten Rink: