Challenge

Our client designs, manufactures, and leases industrial equipment and provides software to remotely monitor equipment operations. They looked to Manifold to accelerate the prototyping of intelligent features in the software platform that could reduce downtime of machinery by predicting and preventing faults.

Working closely with our client’s software engineering team, Manifold built a predictive analytics prototype that uses equipment sensor time-series data to predict and prevent machine down-time. The system alerts field teams about units at risk of faulting, so they can proactively take action before any failure. The client’s team is able to continue the work on their own, maintain the code, and conduct further experiments using the data processing pipeline and machine learning framework
we created.

Solution

In the ramp up phase, Manifold met with the project’s business sponsor and the software development organization to more explicitly define the project’s business and technology objectives. We helped our client translate business objectives into a product specification.

Data Engineering at Scale

We audited our client’s data to better understand their data sources, quality, and resolution. The bulk of the ETL effort involved merging multiple data sources in varying formats from the client’s data lake. We devised a data engineering strategy that sourced terabytes of data—including real-time streaming sensor data, hardware-specific demographic information, human-generated maintenance reports, and external weather data.

Working with the client’s engineering team, we were able to understand when to segment and when to normalize the data based on varying applications (e.g., vapor recovery vs. fluid pumping) and heterogeneous conditions (e.g., electric vs. gas-powered engines, variable engine sizes, and variable pipe diameters). We used advanced data storage, compression, and collation techniques, including Apache Spark and Dask, to combine continuous sensor data with demographic data from hundreds of units across three years. We encountered and mitigated data complexities such as data dropouts, inconsistent timing, variable-human input, and missing data for different regions.

Manifold defined targets that captured the largest amount of business value, yet could be modeled with reasonable predictive accuracy. We initially tried to model rare events such as parts replacement and emergency work orders, but found that the human-generated data alone did not provide a clear enough target to model. Fortunately, we were able to couple this human data with alarm codes in the real-time sensor data to create an optimal synthetic target.

Data Preparation

After assessing the data and defining predictable targets, we performed feature engineering on the data stream to create appropriate inputs for the time-series forecast problem. We varied the look-back and prediction horizon windows, and carefully created training and validation data sets so as to avoid data leakage. We used bootstrap sampling techniques that upsampled rare events and downsampled common events, thereby increasing rare-event occurrence in the training data, and enabling our models to better predict the target. We also downsampled the data set to achieve better model development speed, so we didn’t have to wait for hours
to train.

Model Engineering

We followed our standard process to evaluate multiple model architectures, from logistic regression to tree-based ensemble techniques to neural networks. We used convolutional neural networks without feature engineering as a baseline for accuracy, but settled on two classes of tree-based models (random forests and gradient boosted trees), because they demonstrated better performance, were easier to tune, and were also more interpretable. We worked with the client’s engineers to define expert features (e.g., pressure and temperature ranges), to optimize the model accuracy, and to interpret the output of the model (e.g., feature importances).

Business Integration

After optimizing the hyperparameters for a family of models, we created a prediction job that updated a database with daily predictions.

The next phase of work involved shortlisting units that were likely to fail, and reviewing the sensor data with our client’s engineers to understand their process of triage. We used tree-interpretation to explain what the AI system was picking up on for individual predictions.

We then worked with the client’s mechanics and engineers to prototype a workflow that would enable them to use the predictions in a meaningful way. By following an agile, iterative design process, the tool rapidly evolved from a spreadsheet to an integrated GUI.

By understanding the client’s existing workflow, we weighed the cost of investigating a false positive against the benefit of catching a failure before it occurs. We worked with the client to determine acceptable the TPR (true positive rate) and FPR (false positive rate), which set a threshold for review of a potential failure.

Our prototype shortlisted the units that required attention from the engineers. Because certain unit types and regions were more prone to failure than others, and different units had differing acceptable failure rates, we created a custom rules-based algorithm that identified units of interest to the engineers. This algorithm also merged signal processing techniques with machine learning predictions to account for the changes in the baseline propensity of a unit to fail.

Team training

By working closely with their software engineering team, we were able to build their capability to continue developing the prototype and maintain the codebase.

Results

Our client now has a predictive maintenance prototype. The software engineering team has the tools and skills to develop the product further and deploy it to the field operations organization.