–
21.05.
The goal of this talk is to inform participants about two concrete and widely used data analytics techniques that are suitable to analyse ‘big data’ for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistical computing in data analytics, the talk will offer details on the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From the broad class of available methods we focus on the density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies.
Abstract
The goal of this talk is to inform participants about two concrete and widely used data analytics techniques that are suitable to analyse ‘big data’ for scientific and engineering applications. After a brief introduction to the general approach of using machine learning, data mining, and statistical computing in data analytics, the talk will offer details on the ‘clustering’ technique that partitions datasets into subgroups (i.e. clusters) previously unknown. From the broad class of available methods we focus on the density-based spatial clustering of applications with noise (DBSCAN) algorithm that also enables the identification of outliers or interesting anomalies. A parallel and scalable DBSCAN implementation, based on MPI/OpenMP and the hierarchical data format (HDF), will be discussed in the context of interesting scientific datasets. The second technique that the talk will adress is ‘classification’ in which groups of datasets already exist and new data is checked in order to understand to which existing group it belongs. As one of the best out-of-the-box methods for classification the support vector machine (SVM) algorithm including kernel methods will be a focus. A parallel and scalable SVM implementation, based on MPI, will be described in detail by using a couple of challenging scientific datasets and smart feature extraction methods. Both aforementioned high performance computing algorithms will be compared with solutions based on a variety of high throughput computing techniques (i.e. map-reduce, Hadoop, Spark/MLlib, etc.) and serial approaches (R, Octave, Matlab, Weka, scikit-learn, etc.).
Dr. - Ing. Morris Riedel
Dr. - Ing. Morris Riedel is an Adjunct Associate Professor at the School of Engineering and Natural Sciences of the University of Iceland. He received his PhD from the Karlsruhe Institute of Technology (KIT) and started the work in parallel and distributed systems in the field of scientific visualization and computational steering of e-science applications on large-scale HPC resources. He previously held various positions at the Juelich Supercomputing Centre in Germany. At this institute, he is also the head of a specific scientific research group focussed on “High Productivity Data Processing” as part of the Federated Systems and Data Division. Lectures given in universities such as the University of Iceland, University of Applied Sciences of Cologne and University of Technology Aachen (RWTH Aachen) include 'High Performance Computing & Big Data', Statistical Data Mining', ‘Handling of large datasets’ and ‘Scientific and Grid computing’. His current research focusses on 'high productivity processing of big data' in the context of scientific computing applications.
Location
Willersbau A317 (Zellescher Weg 12-14)