The Python ecosystem offers a number of incredibly useful open source tools for data scientists and machine learning (ML) practitioners. One such tool is Dask, available from Anaconda. At Manifold, we have used Dask extensively to build scalable ML pipelines.
Thanks to this experience, we've identified design patterns for using Dask most effectively. This post is the first in a series aimed at sharing these design patterns; the rest of the series will include:
- Dask for ML Workflows: Preprocessing
- Dask for ML Workflows: Feature Engineering
- Dask for ML Workflows: Modeling and Hyper-Parameter Optimization
- Dask for ML Workflows: Cross-Validation
Throughout the series, we will also include anti-patterns to avoid, and when you shouldn't use Dask. Let's start with an overview of how Dask works, and how we use it.
Dask in a nutshell
If you've ever worked with data at scale before, you know that many issues quickly arise even when trying to do relatively simple operations. Two important issues that frequently come up are:
1) Your dataset does not fit in memory of a single machine
2) Your data processing task is taking a long time and you would like to optimize it
Dask addresses both of these issues. It is an easy to use, out-of-core parallelization library that seamlessly integrates with existing NumPy and pandas data structures.
What does it mean to do out-of-core parallel processing? First let's look at the out-of-core part. Out of core processing means that data is read into memory from disk on an on-needed basis. This drastically reduces the usage of RAM while running your code, and makes it far more difficult to run into an out-of-memory error. The parallel processing part is straightforward — Dask can orchestrate parallel threads or processes for us and help speed up processing times.
The diagram below describes this all at a high level. Let's say we want to perform an operation on our larger-than-memory dataset. If the operation can be broken down into a sequence of operations on smaller partitions of our data, we can achieve our goal without having to read the whole dataset into memory. Dask reads each partition as it is needed and computes the intermediate results.The intermediate results are aggregated into the final result. Depending on the specific operation, there may be many layers of intermediate results before we get our final result. Dask handles all of that sequencing internally for us. On a single machine, Dask can use threads or processors to parallelize these operations. Dask also provides a distributed scheduler that works on Kubernetes clusters.
On the parallelism front, Dask provides a few high-level constructs called Dask Bags, Dask DataFrames and Dask Arrays. These constructs provide an easy-to-use interface to parallelize many of the typical data transformations in ML workflows. Furthermore, with Dask, we can create highly customized job execution graphs by using their extensive Python API (e.g. dask.delayed) and integration with existing data structures.
Dask in Machine Learning workflows
The following diagram shows how Dask can be used throughout the ML workflow, from data processing to model evaluation. While there are other phases of ML such as exploratory data analysis and error analysis, these four represent the primary workflow of a practitioner. Each of our posts in this series will focus on one of the four phases of the machine learning process identified in this figure.
To illustrate the efficacy of Dask, we will walk through a real-world project: classify the arXiv repository of peer-reviewed journal articles according to their categories, using only the abstracts as predictors.
Keep an eye out for future posts! And, in the meantime, here are some generally useful resources:
- An excellent tutorial by Matt Rocklin, the lead developer of Dask
- A longer tutorial with code
- A ML Docker image we created at Manifold with all the libraries we typically rely on including Dask
- A similar Docker image, targeted for Deep Learning with Dask, Pytorch, Tensorflow, and more