Scalable Machine Learning Pipelines with Dask

Date: Friday, April 5, 2019
Time: 1:40pm–2:20pm

Dask is a powerful library within the PyData ecosystem. There are a number of great resources on Dask, from documentation and blog posts to tutorial videos. However, we noticed that there is not yet any comprehensive resource specific to the applications of Dask in machine learning pipelines. This talk aims to fill that gap.

Dask is useful in various stages of machine learning pipelines, from data preprocessing to hyper-parameter tuning. We show some of these uses cases in the attached diagram. We will present a unified approach to thinking through the applications of Dask in ML workflows that can help you build scalable ML pipelines. We will focus on a case study where the goal is to classify journal papers into different topic categories.

Key audience takeaways will include:

  1. How to identify challenges that can be addressed using Dask in Machine Learning
  2. A set of design patterns for applying Dask to Machine Learning workflows
  3. A set of examples with code, taken from real-world applications

 

Update: You can now view video of this presentation on YouTube.