Big Data Integration and Analysis
While the term Big Data is mostly associated with the challenges and opportunities of today’s growth in data volume and velocity, it is also characterised by the increasing diversity of data. The spectrum of data sources ranges from sensor networks and protocol information from industrial machines to log and clickstreams from increasingly complex software architectures and applications. In addition, there is a steady increase in commercial or publicly available data, such as data from social networks like Twitter or Open Data. More and more companies are looking to leverage all these existing types of data in their analytics projects to gain additional insights or enable new features in their products.
The need for a company-wide, integrated view of all relevant data has classically been met with the help of relational data warehouse infrastructures. However, due to the necessary schema definitions as well as the rigid and controlled Extract-Transform-Load (ETL) processes that require well-defined input and target schemas, these infrastructures are not flexible enough to accommodate situation-related data of the most diverse structure. Apart from the technical challenges, it is often not even desirable to integrate all the data that accumulates in a Big Data landscape, as its future use cases are mostly unknown. Due to this development towards agile and explorative data analysis, new principles of information management have emerged, such as data lake architectures, which aim to ingest data of any format in a simple way. Although this facilitates the data transfer enormously, it only postpones the integration effort to a later point in time and makes it part of the actual analysis process. At the same time, the data integration aspect is usually the most time-consuming and expensive step in many data analysis projects. According to current studies, information workers and data scientists spend 50-80% of their time searching for and integrating data before the actual analysis can begin. Since accurate data integration is referred to as an „AI-complete problem“, which generally requires validation by humans, automation of this task is not foreseeable.
For this reason, various new systems are being developed in ScaDS that rely on the analytical power of relational systems at their core, but extend them with additional capabilities to be able to use data from a wide variety of sources at query time:
1) DrillBeyond enables relationally structured data to be augmented with information from millions of web tables (Dresden Web Table Corpus).
2) FREDDY makes it possible to use unstructured data represented by word embeddings in the context of database systems, for example, to support comparisons and groupings of text values or queries based on the k-Nearest-Neighbor algorithm.
3) The DeExcelarator project deals with the extraction of relationally structured data from Excel spreadsheets. The particular difficulty here is that the structuring of the data varies greatly from user to user. Here, a number of machine learning approaches are necessary to automatically extract the correct information.