Method Sciences

Scalable and Secure Data Platforms

Large amounts of data from different data sources and formats require user-friendly platforms that can be used flexibly and enable fast and secure processing. Specific user requirements are met by versatile configurable hardware and software components from which customised solutions are created. During implementation, special emphasis is placed on parallel data processing as well as acceleration through dedicated hardware and software optimisations. For the selection of suitable components for user applications and as a basis for optimisation, procedures for user-friendly performance determination are applied and further developed.

ScaDS accompanies and supports users from all disciplines in all steps to overcome their challenges in dealing with large amounts of data. The support ranges from simple recommendations and the provision of resources to consulting, concept development and implementation.

Big Data Integration and Analysis

While the term Big Data is mostly associated with the challenges and opportunities of today’s growth in data volume and velocity, it is also characterised by the increasing diversity of data. The spectrum of data sources ranges from sensor networks and protocol information from industrial machines to log and clickstreams from increasingly complex software architectures and applications. In addition, there is a steady increase in commercial or publicly available data, such as data from social networks like Twitter or Open Data. More and more companies are looking to leverage all these existing types of data in their analytics projects to gain additional insights or enable new features in their products.

The need for a company-wide, integrated view of all relevant data has classically been met with the help of relational data warehouse infrastructures. However, due to the necessary schema definitions as well as the rigid and controlled Extract-Transform-Load (ETL) processes that require well-defined input and target schemas, these infrastructures are not flexible enough to accommodate situation-related data of the most diverse structure. Apart from the technical challenges, it is often not even desirable to integrate all the data that accumulates in a Big Data landscape, as its future use cases are mostly unknown. Due to this development towards agile and explorative data analysis, new principles of information management have emerged, such as data lake architectures, which aim to ingest data of any format in a simple way. Although this facilitates the data transfer enormously, it only postpones the integration effort to a later point in time and makes it part of the actual analysis process. At the same time, the data integration aspect is usually the most time-consuming and expensive step in many data analysis projects. According to current studies, information workers and data scientists spend 50-80% of their time searching for and integrating data before the actual analysis can begin. Since accurate data integration is referred to as an „AI-complete problem“, which generally requires validation by humans, automation of this task is not foreseeable.

For this reason, various new systems are being developed in ScaDS that rely on the analytical power of relational systems at their core, but extend them with additional capabilities to be able to use data from a wide variety of sources at query time:

1) DrillBeyond enables relationally structured data to be augmented with information from millions of web tables (Dresden Web Table Corpus).

2) FREDDY makes it possible to use unstructured data represented by word embeddings in the context of database systems, for example, to support comparisons and groupings of text values or queries based on the k-Nearest-Neighbor algorithm.

3) The DeExcelarator project deals with the extraction of relationally structured data from Excel spreadsheets. The particular difficulty here is that the structuring of the data varies greatly from user to user. Here, a number of machine learning approaches are necessary to automatically extract the correct information.

DrillBeyond

Entity Augmentation Queries (EAQ), given a set of entities such as countries, companies, people and a corpus of partially structured data, automatically return the values of an attribute that must also be specified in the query. This can be, for example, the turnover, the CEO or the share price of a company. This information can be found in web tables, among other places. Previous methods return, by means of aggregation, exactly one value per entity. However, it is difficult for the user of such a system to understand how the individual values are composed over a multitude of data sources, which is very critical in the context of further analysis scenarios.

In DrillBeyond, therefore, not only one result per entity set is generated, but a ranked list (top-k) of possible results. The user can inspect this manually and thus verify the origin of the data. The main challenge is to generate consistent results. With a larger number of entities, it cannot be assumed that a single web table contains all the necessary information, such as all the sales of the requested companies. Instead, the final result must be composed of several web tables. For example, care must be taken not to mix up the turnovers of different years or different currencies. In summary, the following problem arises: given an EAQ consisting of a set of entities and a searched attribute, the Entity Augmentation System (EAS) should deliver a diversified top-k list of alternative results (augmentations), which are on the one hand relevant, but on the other hand also consistent and minimal. This objective can be algorithmically mapped to the weighted set cover problem, one of Karp’s original 21 NP-complete problems. Intuitively, given a universe of elements U and a set of subsets of that universe S, all associated with a weight, the objective is to select the optimal set from S such that all elements in U are covered at minimum cost. For this purpose, we develop different greedy algorithms but also an approach based on evolutionary algorithms.

FREDDY

Word embeddings encode a number of semantic as well as syntactic properties of words or text sections into a high-dimensional vector and are therefore particularly useful in natural language processing (NLP) and information retrieval. To represent the rich information stored in word embeddings and use it for use cases in relational database systems, we propose FREDDY (Fast WoRd EmbedDings Database Systems), an extended relational database system based on PostgreSQL. We develop new query types that allow the user to analyse structured knowledge in database relations together with large unstructured text corpora encoded as word embeddings. Supported by various index structures and approximation techniques, these operations can perform fast similarity computations for high-dimensional vector spaces (typically 300 dimensions). A web application can be used to explore these novel query functions for different database schemas and different word embeddings.

DeExcelarator

Spreadsheets are one of the most successful tools for creating content. The easy handling and the extensive functions enable beginners and professionals to create, transform, analyse and visualise data. As a result, large amounts of information and knowledge are stored in this format. This requires automated approaches to explore, interpret and reuse the content of the spreadsheet. However, the high degree of freedom in using spreadsheet software leads to diverse and structurally very differently prepared data. Often the actual data is mixed with formatting, formulas, layout artefacts and other implicit information. Fully automated processing of arbitrary spreadsheets has therefore been difficult to implement in the past, leaving human experts to perform a significant part of the task manually.
In the DeExcelarator project, we are mainly concerned with challenges related to the recognition of relational information in spreadsheets. For this purpose, we have developed a complex processing pipeline that first assigns, by means of a classifier, each individual cell of a spreadsheet to a certain class, such as „Data“, „Header“ or „Metadata“. The individual cells are then grouped into larger ranges. Subsequently, evolutionary algorithms can be used to identify the ranges that together make up a spreadsheet. All processing steps require training data that we have generated ourselves on the basis of the ENRON corpus, a dataset consisting of Emails designed specifically for use in corpus linguistics and language analysis.

Visual Analysis

Big data does not only reach the limits of predictability but above all comprehensibility. Three main areas of Visual Analytics are explored to facilitate a faster understanding. On the one hand, methods are developed that integrate the user deeper into the visualization. For this purpose, novel immersive interaction methods in front of large display walls, as well as techniques of virtual and augmented reality are investigated. On the other hand, semi-automatic methods will be developed, which support the user in the interaction by adapting filters and other visualization parameters. In addition, strategies should be implemented in order to divide the data into easier-to-understand parts, especially in the area of ​​segmentation. Methods of machine learning will be used to achieve this. The third focus is the development of adapted hierarchical visualizations in the field of life sciences. Again, coping with large amounts of data is a major challenge. For this reason, the promising and already successfully used tools should be expanded into a multi-level visualization and interaction workflow. The gained insights are integrated into a better understandable visual analysis and the workflow is extended by new data types.

TU
Universität
Max
Leibnitz-Institut
Helmholtz
Hemholtz
Institut
Fraunhofer-Institut
Fraunhofer-Institut
Max-Planck-Institut
Institute
Max-Plank-Institut