Jim Dowling

Topic: Hopsworks: End to End ML Pipelines

Machine Learning (ML) pipelines are the fundamental building block for productionizing ML code. Building such pipelines with Big Data is a complex process that can pose a significant engineering effort and can incur high maintenance costs. The different stages in ML pipelines also need to be orchestrated, from data ingestion and data transformation, to feature engineering, to model training, serving and monitoring.

In this talk we will present key points on how to take your ML and deep learning (DL) pipelines to the next level using Hopsworks. Hopsworks is an open-source, UI-driven, horizontally scalable multi-tenant platform for Big Data and AI built on the world’s fastest and most scalable Hadoop distribution, Hops. First, we describe how to ingest and pre-process data in real time with technologies such as Apache Kafka and Apache Spark. Then we introduce the Feature Store, a central vault for storing documented and curated features which provides automatic feature analysis and monitoring, feature sharing across models and teams, feature discovery, feature backfilling, and feature versioning. We will show how to conduct and reproduce large-scale ML experiments with Hopsworks’ Experiments service and how to manage Python environments and write Python programs with Conda and Jupyter notebooks, respectively. We will demonstrate how to scale-out deep learning on a Hopsworks cluster with GPUs managed as a resource and we will describe how models are elastically served and monitored in production with Kubernetes, and how models can be analyzed and visualized with Apache Beam and TensorFlow Extended (TFX). Lastly, we show how to manage the entire lifecycle of the pipeline with Apache Airflow, a platform to programmatically author, schedule and monitor workflows. During the talk, we will also share our experiences running Hopsworks on a cluster in Sweden with over 1000 users.