Apache Beam and the price of unification

Type of thesis: Masterarbeit / location: Leipzig / Status of thesis: Finished theses


A lot of Big Data technologies emerged in the past years. Frameworks like Apache Spark or Apache Flink are building a strong analytical backbone for distributed processing. However, with cloud services and other technologies like Apache Apex around the corner, a unified data model which is capable of interchanging the execution engine beneath becomes more and more relevant for industrial use.

Google stated:
“We firmly believe Apache Beam is the future of streaming and batch data processing. We hope it will lead to a healthy ecosystem of sophisticated runners that compete by making users happy, not by maximizing market share via API lock in. The most capable runners at this point in time are clearly Google Cloud Dataflow (for running on GCP) and Apache Flink (for on-premises and non-Google cloud), as detailed by the capability matrix recently published. But others are catching up, and the industry as whole is shifting towards supporting the semantics in the Beam Model. This will only lead to more good things for Apache Beam users down the road.”
(Quelle: https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective)


But what is the cost of this abstraction layer? Is Beam  an architecture which replaces the need for Flink/Spark code in general or is it still beneficial to translate the Beam code after a prototype phase into the execution engine code?
In this thesis we want to evaluate the strength of Apache Beam by benchmarking it against common execution engines.


Dr. Eric Peukert(peukert@informatik.uni-leipzig.de)
Matthias Kricke(kricke@informatik.uni-leipzig.de)


Eric Peukert

Administration Director

Department of computer science

Universität Leipzig