Manifold Blog

Manifold Blog

A Python Toolkit for Docker-First Data Science

Posted by Alexander Ng on Apr 19, 2018 7:00:00 AM
Alexander Ng
Find me on:

As interest in Artificial Intelligence (AI), and specifically Machine Learning (ML), grows and more engineers enter this popular field, the lack of de facto standards and frameworks for how work should be done is becoming more apparent. A new focus on optimizing the ML delivery pipeline is starting to gain momentum.

TL;DR

At Manifold, we developed some tools internally for easily spinning up Docker based development environments for machine learning projects. We are open-sourcing them as part of an evolving toolkit we are releasing called Orbyter. The Orbyter 1.0 package contains a Python cookiecutter template and a public Docker image. The goal of Orbyter is to help data science teams adopt Docker and apply Development Operations (DevOps) best practices to streamline machine learning delivery pipelines.

Data Science and the Delivery Pipeline

Data scientists are becoming more involved in the delivery pipeline of products, and it is a non-trivial task ensuring that their work survives the delivery process. Of course, this isn’t a new problem: in the past, traditional software development teams would throw their work “over the wall” to the operations team to serve in production with little to no context. A community effort to solve the inevitable mess resulted in what we now think of as DevOps, removing the wall between development and operations to drive increased efficiency and improve product quality. New tools and processes to help teams implement streamlined delivery pipelines now help guarantee development/production parity.

Now, the same problem has reared its head in the ML space, and is only getting worse as demand for AI products continues to grow. There is a new wall that is killing productivity. How does DevOps change with the rise of data science teams in engineering organizations? The pain points we are seeing in the community today feel familiar, but also have unique aspects to ML development.

(This obviously needs a catchy name, so let’s call it… MachOps?)

MachopMachop: A bluish-gray Pokémon with large arm muscles. Can be found in the wild building open-source ML productivity tools for data scientists and engineers.

Rise of the Machine Learning Engineer

Simply put, solutions in the DevOps space provide tools for people working at the intersection of development and operations. Similarly, there needs to be a toolkit for the person working at the intersection of data science and software engineering. We call that person a Machine Learning Engineer (MLE). At a high level, MLEs have the same set of challenges as any software engineer working in a product development team:

  1. Standardized local development environments
  2. Development vs. production environment parity
  3. Standardized packaging and deployment pipelines

In addition, certain aspects of the ML development workflow provide a different set of challenges to MLEs:

  1. Easily sharing development environments and intermediate results for conducting reproducible experiments
  2. Coordinating isolated project environments running multiple notebook servers
  3. Easily allowing for vertical and horizontal scaling to handle large datasets or leverage additional compute resources (e.g., for deep learning, optimization, etc.)

By looking through a ML development lens with the DevOps mentality, we can identify several new areas along the delivery path that need improving. There is an opportunity to build new tools and best practices that specifically empower the MLE community to deliver more robust solutions in a shorter amount of time.

MLEtoolkit

Whatever the MLE toolkit ends up including, one thing we are very confident about is: Docker will play a major role in the ML development lifecycle standard.

Docker-First Data Science

By moving to a Docker-first workflow, MLEs can benefit from many of the significant downstream advantages in the development lifecycle in terms of easy vertical and horizontal scalability for running workloads on large datasets, as well as ease of deployment and delivery of models and prediction engines. Docker images running in containers provide an easy way to guarantee a consistent runtime environment across different developer laptops, remote compute clusters, and in production environments.

While this same consistency can be achieved with careful use of virtual environments and disciplined system-level configuration management, containers still provide a significant advantage in terms of spin up/down time for new environments and developer productivity. However, what we have heard repeatedly from the data science community is:I know Docker will make this easier, but I don’t have the time or resources to set it up and figure it all out.

At Manifold, we developed internal tools for easily spinning up Docker-based development environments for machine learning projects. In order to help other data science teams adopt Docker and apply DevOps best practices to streamline machine learning delivery pipelines, we open-sourced our evolving toolkit. We wanted to make it dead simple for teams to spin up new ready-to-go development environments and move to a Docker-first workflow.

How Does it Work?

The Orbyter 1.0 package contains a Dockerized Cookiecutter for Data Science (a fork of the popular cookiecutter-data-science) and an ML Development Base Docker Image. Using the project cookiecutter and Docker image together, you can go from cold-steel to a new project working in a Jupyter Notebook with all of the common libraries available in under five minutes (and you didn’t have to pip install anything).

After instantiating a new project with the cookiecutter template and running a single start command, your local development setup will look like this:

Torus

Fully configured out-of-the-box Dockerized local development setup for data science projects.

Let’s dive a little deeper into what’s happening here:

  1. The ML base development image was pulled down to your local machine from Docker Hub. This includes many of the commonly used data science and ML libraries pre-installed, along with a Jupyter Notebook server with useful extensions installed and configured.
  2. A container is launched with the base image, and is configured to mount your top-level project directory as a shared volume on the container. This lets you use your preferred IDE on your host machine to modify code and see changes reflected immediately in the runtime environment.
  3. Port forwarding is set up so you can use a browser on your host machine to work with the notebook server running inside the container. An appropriate host port to forward is dynamically chosen, so no worries about port conflicts (e.g., other notebook servers, databases, or anything else running on your laptop).
  4. The project is scaffolded with its own Dockerfile, so you can install any project-specific packages or libraries and share your environment with the team via source control.

You can use your favorite browser and IDE locally as you normally would to do your work, while your runtime environment is 100% consistent across your team. If you are working on multiple projects on your machine, rest assured that each project is running in its own cleanly isolated container.

We’re just getting started!

There are several tools and features that we are currently working on and will be adding to Orbyter releases in the very near future. An example of some functionality we are working on:

  1. Easy spin up/down of remote compute for ad-hoc vertical scaling to circumvent memory and compute constraints on your laptop.
  2. Easy spin up/down of a remote cluster to leverage distributed processing with Dask for feature engineering and training on an ad-hoc basis.
  3. Integration with popular BYOC (Bring Your Own Container) platforms to easily leverage features they offer for training and highly scalable deployments (e.g., AWS Sagemaker)

Let us know if there are other use cases you would like to see added to the Orbyter toolkit!

Lay the Foundation

There is a lot of exciting activity going on in the MLE toolkit space and it’s easy to forget that, before even considering a higher-order platform or framework, you need to make sure your team is set up for success. We need what the DevOps movement did for software engineering in the ML delivery pipeline. Moving to a Docker-first development workflow is a great first step in making life easier for everyone involved with the delivery pipeline—and that includes your customers.

 

Editor's Note: This blog post was originally published on Medium. It has been condensed and edited for this space. An even more condensed version also appeared on KDNuggets.

This post has also been updated to reflect the current project name. Since initial publication, the project has been renamed—previously called Torus, it is now called Orbyter.

 

Topics: Data engineering, MachOps, Orbyter

Never Miss a Post

Get the Manifold Blog in Your Inbox

We publish occasional blog posts about our client work, open source projects, and conference experiences. We focus on industry insights and practical takeaways to help you accelerate your data roadmap and create business value.


Subscribe Here


Recent Posts