Manifold Blog

Manifold Blog

Using Dask in Machine Learning: Best Practices

Posted by Jason Carpenter on Jan 31, 2019 6:00:00 AM

Introduction

The Python ecosystem offers a number of incredibly useful open source tools for data scientists and machine learning (ML) practitioners. One such tool is Dask, available from Anaconda. At Manifold, we have used Dask extensively to build scalable ML pipelines.

Thanks to this experience, we've identified design patterns for using Dask most effectively. This post is the first in a series aimed at sharing these design patterns; the rest of the series will include:

  • Dask for ML Workflows: Preprocessing
  • Dask for ML Workflows: Feature Engineering
  • Dask for ML Workflows: Modeling and Hyper-Parameter Optimization
  • Dask for ML Workflows: Cross-Validation

Throughout the series, we will also include anti-patterns to avoid, and when you shouldn't use Dask. Let's start with an overview of how Dask works, and how we use it. 

Dask in a nutshell

If you've ever worked with data at scale before, you know that many issues quickly arise even when trying to do relatively simple operations. Two important issues that frequently come up are:

1) Your dataset does not fit in memory of a single machine
2) Your data processing task is taking a long time and you would like to optimize it

Dask addresses both of these issues. It is an easy to use, out-of-core parallelization library that seamlessly integrates with existing NumPy and pandas data structures.

What does it mean to do out-of-core parallel processing? First let's look at the out-of-core part. Out of core processing means that data is read into memory from disk on an on-needed basis. This drastically reduces the usage of RAM while running your code, and makes it far more difficult to run into an out-of-memory error. The parallel processing part is straightforward — Dask can orchestrate parallel threads or processes for us and help speed up processing times.

The diagram below describes this all at a high level. Let's say we want to perform an operation on our larger-than-memory dataset. If the operation can be broken down into a sequence of operations on smaller partitions of our data, we can achieve our goal without having to read the whole dataset into memory. Dask reads each partition as it is needed and computes the intermediate results.The intermediate results are aggregated into the final result. Depending on the specific operation, there may be many layers of intermediate results before we get our final result. Dask handles all of that sequencing internally for us. On a single machine, Dask can use threads or processors to parallelize these operations. Dask also provides a distributed scheduler that works on Kubernetes clusters.
ML pipelines - dask single machine
On the parallelism front, Dask provides a few high-level constructs called Dask Bags, Dask DataFrames and Dask Arrays. These constructs provide an easy-to-use interface to parallelize many of the typical data transformations in ML workflows. Furthermore, with Dask, we can create highly customized job execution graphs by using their extensive Python API (e.g. dask.delayed) and integration with existing data structures.

Dask in Machine Learning workflows

The following diagram shows how Dask can be used throughout the ML workflow, from data processing to model evaluation. While there are other phases of ML such as exploratory data analysis and error analysis, these four represent the primary workflow of a practitioner. Each of our posts in this series will focus on one of the four phases of the machine learning process identified in this figure.

ML pipelines - dask uses cases in ML

To illustrate the efficacy of Dask, we will walk through a real-world project: classify the arXiv repository of peer-reviewed journal articles according to their categories, using only the abstracts as predictors. 

Keep an eye out for future posts! And, in the meantime, here are some generally useful resources:

  1. An excellent tutorial by Matt Rocklin, the lead developer of Dask
  2. A longer tutorial with code
  3. A ML Docker image we created at Manifold with all the libraries we typically rely on including Dask
  4. A similar Docker image, targeted for Deep Learning with Dask, Pytorch, Tensorflow, and more

Topics: Data science, Data engineering, Machine learning

Never Miss a Post

Get the Manifold Blog in Your Inbox

We publish occasional blog posts about our client work, open source projects, and conference experiences. We focus on industry insights and practical takeaways to help you accelerate your data roadmap and create business value.


Subscribe Here


Recent Posts