Clients come to us at all different project phases, but a significant number fall into two buckets:

  1. They have a solid use cases in mind, but don't have expertise around the data pipeline, and the data they actually need from machine learning (ML). They often haven't deployed a model before.
  2. They start from the business strategy side, saying “Our CEO has a five or ten-year vision around transforming our core business and how we service our customers.” They're often in more traditionally industrial businesses like staffing or manufacturing. Software as a Service is a fairly new idea for people in this bucket—they're interested in some kind of move to a subscription model, but they usually have not identified a particular use case.

During these conversations, we focus on starting with the product strategy, before moving onto building and delivering software or ML models. With artificial intelligence being such a buzzword, people often want to jump to the AI pieces, but it's important to start with relevant business strategy and go-to-market questions—ML is not always the right answer. You need to be sure you don't skip past data engineering gaps; the right infrastructure is key.

Build Your Foundation First

Most people are familiar with the traditional software product development processes, design iteration, and so forth. There are some different considerations when dealing with data and ML. We have a process we call Lean AI, where we've incorporated the idea of a feedback loop between business understanding, data understanding, and then doing some data engineering and some modeling.

Take a situation where you may have a particular idea around what the ideal user experience might be. But then as we start to get into the data and trying different modeling techniques, we might surface additional compelling actions the user could take in their workflow. It may be that the original experience as envisioned is going to have to change because there is not enough predictive power in the data, or a data source that you thought you’d be able to get your hands on is just not going to be available.

In addition to exploring the data and what you access to, there also may be more traditional software constraints. If it’s going to take six months to get a particular piece of data cleaned up enough such that we can actually use it, is there something lighter weight that we could at least get started with, and then continue to refine and iterate over time? This science of what's possible given the signal inside the data, that's the biggest difference in terms of traditional product development that just involves software engineering in apps versus working with the data and ML.

Don't Skip Data Engineering

The engineering part of data projects has, historically, been discussed less than writing algorithms and deploying models. Much of the relevant content generally focuses on how to get your first model going, or grabbing a particular data set and playing with it. In that sense, there are a lot of tools available for doing data science and data exploration.

Of course, it's great if someone builds a model that's interesting. However, if at the end of the day no one is pushing a button differently because of your model, or pulling the lever differently, it doesn't matter that you built it in the first place. Having the right engineering and product development expertise, as opposed to solely data science expertise, will save you from making this mistake.

It may seem at first like having this engineering and planning mindset may slow things down, but it's actually the opposite. Tackling big risks early is one of the major themes of what we do at Manifold, and that's largely borne of our engineering and product experiences. Imagine it like doing paper mockups before you get to higher fidelity mocks. There's a similar idea in ML where the idea is, “Okay, get some basic data through your data pipeline. It doesn’t have to be perfect.”

Then we build on the baseline model, using basic techniques (like random forest) that make it really easy to understand what the model is doing. We can get some baseline of performance pretty quickly—does it perform at 60% or 80%? From there, we decide how much more of an investment we want to make, what other data we may need, etc. We can add functionality as we go.

In Other Words

Have a holistic view of both your project and your organization; start thinking about an 18-month timeframe:
  • Do you want to build a software engineering organization?
  • Do you want to build a data engineering capability?
  • Do you want to have a data science team?
  • Do you want to work the finance team to get a couple business analysts over to a new team?

Be sure to share this view with your organization. The people building your models or data pipeline for example, will find it helpful to have the broader context as opposed to a very narrow window into, “I need these three fields to be cleaned up and available.” If you can't provide that broader context, then you will end up with a lot of disjointed pieces. 


This article is adapted from Vinay Seth Mohta's appearance on The Designing for Analytics podcast. Check out the full episode for more on how to approach machine learning projects in the enterprise.