Version Control for Environmental Modelling Data
Background
The Department of Environmental Informatics of the UFZ develops software for the simulation of environmental phenomena via coupled thermal, hydrological, mechanical and chemical processes by using innovative, numerical methods. Examples include the prediction of groundwater contamination, the development of water management schemes or the simulation of innovative means of energy storage. The modelling process is a complete workflow, starting with data acquisition and -integration to process simulation to analysis and visualization of calculated results.
This modelling process is not transparent and traceable and often poorly documented. A typical model is developed over many weeks or months and usually a large number of revisions are necessary for updating and refining the model such that the simulation is as exact as possible. The first setup of a model is often used to get an overview over existing data and to detect potential problems in both data and numerical requirements. Further revisions try to solve these problems by adding data, refining or adjusting finite element meshes or updating and adjusting processes and their parametrization. Both input data and parameter files range from few / small up to hundreds of files containing detailed spatial, temporal or numerical information. Likewise, changes from one modelling step to the next may be small (e.g. one parameter value in a single input file) or major (e.g. geometrical input changes and requires a new discretization of the FEM domain as well as a new parameterization).
Currently, all required files are stored locally on the laptop of each scientist. An overview over the evolution of a model is easily lost and it is difficult to trace the order and nature of previous changes. Collaborative work on a single model is difficult. An enormous deficiency of the current solution is that there is no implicit documentation of the changes and each user stores a log of such changes on their own laptop at their discretion.
Objectives
In cooperation with ScaDS, we develop a solution that circumvents those impediments by introducing a uniform, central and consistent storage of the individual revisions, such that each scientist
- has access to the simulation data they are entitled to
- has a backup if local data is lost or corrupted
- has the possibility to automatically track, analyse and evaluate the changes in each step of the modelling process
Preliminary results
A preliminary prototype for a version control system tailored to the requirements of environmental modelling has been developed, employing KITDM (Karlsruhe Institute of Technology Data Manager) as the software architecture for building up repositories for research data. It provides a modular and extensible framework to adapt to the needs of various scientific communities and use cases and employs established standards and standard technologies. It provides adjustable data storage and data organization, easy to use interfaces, high performance data transfer, a flexible role-based security model for easy sharing of data and offers flexible meta-data indexing and searching functionality. It is seamlessly integrated in MASI (Meta-data Management for Applied Sciences), a current research project which aims at establishing a generic meta-data management for scientific data with research focusing on generic description of meta-data, generic backup and recovery strategies based on meta-data management and enhanced subsequent processing of data.
For the storage of environmental model revisions, we employ an efficient and “disk saving” storage strategy such that specific parameter files are stored only if their content has actually been changed in the latest revision (selective upload strategy). The model development is stored in a tree structure, each node (revision) has a unique link to its predecessor (i.e. the revision from which it has been derived). Due to the tree structure, some nodes (revisions) may become irrelevant if a certain modelling approach has been rejected and development is progressed from a previous iteration. Meta data for a modelling project (e.g. the name of the project, the researcher in charge, modifications to a certain file in a selected revision, or the tree structure of all revisions) are stored in a meta-data file and thus contain a basic documentation which is automatically generated during the development of a model. Additional meta-data files can be generated as needed by the scientists developing the model. In this way, a preliminary documentation regarding the development of the model is created at the initial set-up of a model and complemented after each revision.
A correspondent download strategy to the „selective upload“ is used, such that the for a specific revision the latest version of all relevant files are downloaded, such that the user always has a complete model available after download.
ScaDS project members:
Dr. Martin Zinner, Dr. Karsten Rink, Dr. Thomas Fischer, Dr. René Jäkel
Publication
Zinner M, Rink K, Jäkel R, Feldhoff K, Grunzke R, Fischer T, Song R, Walther M, Jejkal T, Kolditz O, Nagel W E: Automatic Documentation of the Development of Numerical Models for Scientific Applications using Specific Revision Control. In Proc of the Twelfth International Conference on Software Engineering Advances, 2017. (accepted).