Attending this event?
Back To Schedule
Saturday, January 29 • 10:30am - 10:55am
Building Petabyte Scale ML Models with Python

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes.

It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this talk, we’ll answer all those questions, and more!

We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem.

Here is a sketch of a system designed to achieve this goal (of building scalable ML models):

1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm

Then we’ll read a large dataset into Dask, Tensorflow (tf.data) & sklearn streaming, and immediately apply what we’ve learned about in last section. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the data is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.

Detailed Outline

1. Intro to out-of-core learning
2. Representing large datasets as instances
3. Transforming data (in batches) – live code [3-5]
4. Feature Engineering & Scaling
5. Building and evaluating a model (on entire datasets)
6. Practicing this workflow on another dataset
7. Benchmark other libraries/ for OOC learning
8. Questions and Answers

Key takeaway

By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries.

Participants would have a benchmarks and best case practices for building such ML models at scale.

Session chairs: Justin Nixon and Michal Ruprich

avatar for Vaibhav Srivastav

Vaibhav Srivastav

Data Scientist, Deloitte Consulting LLP
Hi! I am a Data Scientist working with Deloitte Consulting LLP, I work with Fortune Technology 10 clients to help them make data-driven (profitable) decisions. In my surplus time I serve as a Subject Matter Expert on Google Cloud Platform to help build scalable, resilient and fault... Read More →

Saturday January 29, 2022 10:30am - 10:55am CET
Session Room 4