Within the scope of research and development work along the value chain of materials, construction, simulation, production and the finished component and its operation, a substantial amount of structured and unstructured data is generated in each area (see Figure 1). This includes simulation data, construction data, machine and production logs as well as sensor data. The individual data records are characterized by strongly varying sizes (kilobytes to terabytes), heterogeneity and their chronological sequence along the value chain. In addition, new knowledge is generated within the scope of the activities, which is currently mostly available in isolated form in the context of the respective research activity.
On the basis of the presented initial situation, the development and application of methods of Big Data along the value chain in ScaDS Dresden/Leipzig will be advanced and the potential within it be raised. From the material and engineering sciences‘ point of view, the main objectives are:
Based on a demonstrator vehicle (FiF of the SFB639), a browser-based software is developed for the cost- and material-efficient design of lightweight structures together with the professorship for Computer Graphics and Visualization, TU Dresden. This, for the first time, allows visualization of simulation data over several scales – from the filament, through roving to multilayer composites. The software is platform-independent, so it can run on different operating systems and does not have to be installed. Thus, calculation results are presented quickly.
For the High-Performance-Computing (HPC) clusters Taurus and Venus, user-friendly starter scripts are developed together with the ZIH, TU Dresden for the use of Finite-Element Programs Abaqus and LS-Dyna on HPC resources with a SLURM environment. The scripts mean a considerable cost reduction for the user, who only needs to specify basic parameters such as the number of nodes and CPUs to use. The computing environment on the HPC resource is automatically defined and managed. In future work, the parallelization is to be further increased as well as a graphical user interface to be created. The work is done by the Institute of Lightweight Engineering and Technology, Prof. Gude, TU Dresden.
Analyses of human-environment systems are characterized by a spatially high-resolution and temporally dynamic acquisition and modeling. The spectrum ranges from the simulation of regional climate systems with subsequent impact analyses by coupled models to the sensor-based control of vehicles with telemetry and video data. Spatial and time-series analyses, prospective ensemble calculations and real-time simulations are examples of exceptionally data- and compute-intensive applications.
At present, coupled models are often used for these applications: The model data are prepared and managed mainly for each individual case with individual routines and specific data storage variants. This results in considerable limitations for the coupling of models and the integration of measured and calculated data, which is why the representation of real complexity and dynamics is considerably restricted. On the other hand, the computational processes are extremely time-intensive due to the isolated solutions and the processing overhead, which narrowly limits the depiction of uncertainties and the effects of alternative system interventions.
At the same time, the increasing networking of different modes of transport in passenger traffic leads more and more often to the implementation of local and regional data disks and marketplaces for the data-intensive and time-critical linking of traffic data for information services. In the area of traffic technology, vehicles are more and more becoming data transmitters and receivers of environmental signals (e.g. GPS, WLAN, GSM, DAB) as a communication unit interacting with other vehicles and infrastructures (Car-2-X communication). The requirements for corresponding software systems for data evaluation and reduction therefore continue to increase. For the implementation and testing of the methodological sciences of the competence center ScaDS Dresden/Leipzig, the following three sub-topics are focused, whereby cross-references are explicitly included:
Participating Working Groups:
The exponential growth of internet-based communication means that the humanities and social sciences have access to a large amount of data for data-driven analysis. Large-scale digitization programs and the release of official data have also enabled the retrieval of historical and statistical sources. The particular challenge in the field of digital humanities is the linking and interplay of quantitative, data-driven analysis with qualitative interpretations, so that questions of knowledge extraction are particularly important. For the processing of very high data volumes, complex data structures and fast changes, Big Data methods are required, as they are already more established in other specialist disciplines. At the same time, psychologists and social scientists can use their methodological repertoire to contribute to the critical reflection on how to deal with Big Data in science and business.
In the area of digital humanities, the traditional separation of resources and the metadata describing them increasingly leads to a separation of the resources themselves into their individual components. In the case of text collections, this relates, for example, to the actual raw texts, various annotations and their metadata. Specifically, this requires a multi-tiered architecture: a storage solution that contains the full texts as well as annotations, various indices (including, among other things, annotations and metadata) and suitable interfaces for the provision of the data using existing indexes. For the preparation and annotation of the data, it must be noted that there are various tools which serve a similar purpose (such as crawling, cleaning, segmentation, tagging or parsing procedures), but these only work efficiently for specific domains. Additionally, the quality of used processing chains also depends directly on the pre-processing steps used and the appropriate parameterization of the individual processes. The resulting high number of different, generated versions of a resource on the basis of the same raw data leads to a significantly increased demand for storage space and computing capacity as part of the systematic evaluation and efficient provision of these data.
The work is carried out by the Natural Language Processing group of the University of Leipzig. A more detailed overview of the stock of data and tools including ready-to-use demos can be found on the project website.
The Business Data division is researching IT systems to support cross-company value-added systems and their transformation. In the area of Big Data research, we focus on the areas of fast data evaluation in real-time, as well as the conception of intelligent (smarter) applications for data-driven business processes.
To achieve these objectives, data processing is carried out on two levels. In order to process data in real-time, complex event processing (CEP) techniques are used. Preprocessed event data will be enriched and merged with background information from company and web data sources. For the mass of the data, including the event data required for retrospective analysis purposes, data is integrated into Big Data warehouses, which are located on a scalable platform for comprehensive evaluations. The results of the analysis from the Big Data warehouse are fused with the real data from the actual applications and checked for event patterns. The solutions developed are evaluated in a number of use cases and domains. The three main areas are:
Biomedical research is a very dynamically growing field of science, characterized by the massive use of new, highly data-intensive technologies. At the partner institutes of the planned center of excellence, there are two main focus areas. On the one hand, molecular analysis within the so-called „Omics“ technologies, which are primarily aimed at the universal detection of genes (genomics), mRNA (transcriptomics), proteins (proteomics) and metabolites (metabolomics) in a specific biological sample. On the other hand, processes based on image data are being researched. Additionally, users are faced with the challenge of ever-growing data volumes. The transmission of these data via the existing data networks is no longer possible in any case, so that it becomes necessary to organize the decentralised pre-processing of the data and to establish efficient methods for data reduction directly at the data sources. However, such a decentralisation makes an integrative analysis, which is indispensable for the extraction of knowledge, difficult.
In this application, images of biological systems are evaluated and complex semantic information is extracted from these data. The image data are recorded with different, partly self-developed microscopes. The researchers believe that these microscopes and the resulting data have the potential to reveal more about the genome-encoded units than any alternative approach. The vision is to open a window to cellular development processes. Each cell nucleus and / or cell in developing tissues and organisms should be individually monitored and, if desired, the quantity of labeled proteins can be quantified over time. A corresponding parallel processing and distributed data storage is necessary only on the basis of the resulting data quantities.