Apache Spark and Apache Flink are two typical Big Data analytics frameworks. Their APIs allow the development and testing of an application on a local workstation and later, without changing the source code of the application, distribute work to many computers when the local workstation is not sufficient anymore due to limited resources. The course Big Data Processing on HPC focuses on the step from a local workstation to an HPC environment and presents how the typical Big Data analysis workflow can be organized in an HPC environment. In this course participants will be introduced to running a data pipeline and data processing along with managing the configurations on the HPC environment, using Apache Flink and Apache Spark.
Course Details
Title: Big Data Processing on HPC
Speakers: Apurv Deepak Kulkarni, Pramod Baddam, Wenyu Zhang
Next Session: 26.10.2023, 10 a.m. – 3 p.m.
Target Group: Users who have a Big Data problem
Language: English
Format: Online Tutorial. The room link will be announced after registration.
Registration: https://events.scads.ai/e/bigdata_hpc
Participation is free of charge.
Add this event to your calendar (iCal).
Agenda
- Introduction
- Distributed Computing with Big Data
- HPC Considerations
- Data Space
- Software
- Hardware
- Big Data Framework Configuration
- Master/Worker
- Parallelism
- Memory
- Hands-On Session
- Conclusion/Supplementary
Handouts
The course material (slides, sample application) will be available.
Prerequisites
It is recommended, that participants have basic knowledge of Big Data frameworks (e.g. Apache Flink, Apache Spark). Furthermore, basic HPC knowledge would be helpful, but is not required.
Learning Outcomes
Participants will be able to start and configure a Big Data cluster and run their own applications on HPC.
Do you have any questions about the tutorial Big Data Processing on HPC? Don’t hesitate to contact our team!
Check out the other trainings by ScaDS.AI Dresden/Leipzig.