Tracing of Spark Communication with Score-P

Status: finished / Type of Theses: Diploma Theses / Location: Dresden

In the Big Data domain, Java-based frameworks, such as Apache Spark, provide an approach to distribute the application workload for processing large-scale datasets. For fast and resource-efficient execution of such applications, it is important to find and optimize program sections that limit the speed of the application’s execution. For performance analysis, the code of an application is enhanced with measurement code for capturing timestamps of method entries and exits (instrumentation). A program trace or profile is automatically created when the application executes and can be later presented to the performance analyst. Currently, the measurement captures method entries and exits, as well as thread events, such as when one parent thread creates a child thread. As communication between processes can be an interesting source of performance insights, measurements need to collect information about messages sent between different processes of a Spark cluster. Such messages can be of different types, such as status messages sent between the Spark master and workers, as well as messages for data transfer sent from one executor to another.

In this master or diploma thesis, a method should be investigated to identify code regions in the Spark framework that are related to communication. In the second step, the identified regions should be instrumented automatically, so that they collect timestamps, message types and message sizes of communication sections when they are executed. The runtime overhead of the message tracing should be evaluated. A measurement infrastructure based on Score-P (see: http://www.score-p.org) with instrumentation and filtering possibilities is available.

 

Envisioned Tasks

  1. Investigation of communication in Spark
  2. Investigation of instrumentation methods
  3. Implementation of the proposed instrumentation method
  4. Validation and analysis of the proposed solution
  5. Documentation of the implementation and results

Requirements

  • Knowledge of Java
  • Knowledge of Java instrumentation, e. g. with ASM (see: https://asm.ow2.io)
  • The language can be either German or English.
funded by:
Gefördert vom Bundesministerium für Bildung und Forschung.
Gefördert vom Freistaat Sachsen.