Your Project Needs a Data Readiness Audit
In the early phase of a project, we dive into the “Understand” step of our Lean AI framework. There are two main forms of understanding we aim for—business understanding and data understanding.
In our work, we have found our “AI Uncertainty Principle” to be a useful heuristic to keep in mind. Namely, the value that you get out of an AI project is bounded by the value of the business problem multiplied by the data quality multiplied by the predictive signal in the data. If either the value of the business problem or the data quality is too low, then your project won't be successful. The last factor, predictive signal, we have no control over—but the first two we do. That's why it's critically important to de-risk these factors early with good data auditing.
Before starting big projects, we strongly recommend protecting your organization from costly missteps by performing some form of AI Data Readiness audit. This audit should look at:
- identifying the data you have available, including assessing quality and quantity
- integrating the data you have
- addressing data engineering needs, including building pipelines and monitoring the system
At Manifold, we have created a 15-page audit document incorporating decades of collective experience. We review this document with clients in the early phase of an engagement to ensure we have a good handle on data quality and risks.
Here is a small excerpt from the audit, about examining the availability of your data:
For each data source, ask:
- Is the dataset or certain attributes of the dataset covered by corporate policy or regulations, e.g., protected health information (PHI) or personally identifiable information (PII)?
- Regulations vary by levels of jurisdiction (state, national, and international) and are generally based on both the content of the data and based on the location of the individuals covered in the data. Some types of data and associated regulations may limit even your ability to evaluate the data without engaging in specialized processes (e.g., getting approval from the compliance, privacy, and security teams). Examples of these regulations include the GDPR and the California Privacy Act.
- If the data is regulated, could the provider of the data modify it in a way that can still support analysis?
- Examples include:
- Consistently hash values of a particular field (e.g., SSN)
- Change attribute values to maintain the overall properties of the dataset (e.g., change date of visit of a patient to be +/- 5 days while maintaining the sequencing of all visits of that patient).
- De-identifying or anonymizing data may also be possible. However, certain types of data have well-defined standards for de-identification, while other types may require specialized knowledge and validation.
Interested in learning more? Get in touch.
Prior to Manifold, Vinay was co-founder and CTO at Kyruus, a venture-backed software company that offers a data-driven platform to better match patients and physicians. He has also been a product manager at KAYAK, where he worked with both Hadoop and Hive to develop a robust view of customers as well as a predictive model for flight pricing. He is a co-inventor on several granted patents for search and faceted navigation.