Streamlining a Machine Learning Project Team
Date: Tuesday, March 26, 2019
Location: Room 2005
Artificial Intelligence is already helping many businesses become more responsive and competitive, but how do you move machine learning models efficiently from research to deployment at enterprise scale? It is imperative to plan for deployment from day one, both in tool selection and in the feedback and development process.
As recently as a few years ago, data scientists were the people who played in a sandbox—when they came up with a useful model, it was thrown over the wall to another team that would reimplement it to put it into production. Those days are over now: there’s only one Git repo in the entire company, and everything you commit is essentially in production. But teams are still run as if data science is mainly about experimentation.
This tutorial presents best practices for working in this new reality. Data scientists can still play in a sandbox, but do it in a way such that it’s turnkey to take models into production. Just as DevOps is about people working at the intersection of development and operations, there are now people working at the intersection of data science and software engineering who need to be integrated into the team with tools and support. At Manifold, we’ve developed the Lean AI process and the open-source Orbyter package for Docker-first data science to help do just that.
Sourav Day and Alex Ng explain how to streamline a machine learning project and help your engineers work as an an integrated part of your development and production teams.
- Understanding both the business problem and the data
- Containerized data science for cleaner workflows
- Data engineering as a core competency
- Building iterative data models to deliver value early
- Best practices for bookkeeping ML experiments
- Developing user trust in the data models
- Seamless deployment at production scale
- Observing and validating on-the-ground model use
Applications of Mixed Effect Random Forests
Location: Room 2011
Clustered data is all around us. The most common example we see is longitudinal clustering, where each individual instance of a phenomena you wish to model has multiple associated measurements. For example, say we want to model math test scores as a function of sleep factors, but we have multiple measurements per student. Another common example is clustering due to a categorical variable—clusters representing the specific math teacher of a group of students. Thus clustering can also be hierarchical: there is a student cluster contained within a teacher cluster (which is yet contained within a school cluster). When modeling clustered data, we want to account for any idiosyncrasies and non-negligible random effects by cluster.
The best way to attack this kind of data? Mixed effect models. Inspired by the models we have been building for clients, we at Manifold have developed an open source implementation package in Python: Mixed Effects Random Forests (MERF).
In this talk, Sourav will explain how the MERF model marries the world of classical mixed effect modeling with modern machine learning algorithms, and how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning. He will also walk through example use cases, and demonstrate MERF performance on synthetic and real data.
Sourav Dey, CTO