The privacy-cyber world seems preoccupied with issues related to the nexus between personal data and AI. Those issues, although important, are dwarfed by a more pressing and fundamental question: can we get AI to do useful things reliably and accurately in the realm of predicting significant human outcomes, such as health, criminal propensity, credit risk, and so on (“Predictive AI”)?  Arvind Narayanan and Sayash Kapoor, two luminaries in the AI field from Princeton University, suggest the answer is “No” and they make their case in AI Snake Oil: What Artificial Intelligence Can Do, What it Can’t and How to Tell the Difference. Although we very much recommend the book—as it is excellent—we think the thesis is too pessimistic. Companies should not “throw the baby out with the bathwater” but instead distill the precepts that will allow for development of a more rigorous predictive AI that avoids known pitfalls.

The crisis in predictive AI 

Whether one looks at commercial deployments of Predictive AI (which AI Snake Oil takes apart at the seams, in numerous examples) or the use of machine-learning-based predictions in the scientific arena, one is confronted by more failure than success. In the scientific realm, Narayanan and Kapoor, from just a limited sampling, identified “41 papers from 30 fields where [machine-learning] errors have been found, collectively affecting 648 papers and in some cases leading to wildly overoptimistic conclusions” as part of their Leakage and the Reproducibility Crisis in ML-based Science project. The crisis, in their words, “highlights the immaturity of ML-based science, the critical need for ongoing work on methods and best practices, and the importance of treating the results from this body of work with caution.” Id. (italics added). They also said the crisis will continue as the “expected state of affairs until best practices become better established and understood.” Ouch.

When two noted experts call out ML-based science as an immature field and deficient in the area of best practices, and then also question whether commercial applications using Predictive AI can even work, how are companies going to mount defenses against class action lawsuits, respond to regulator investigations, or complete risk assessments? The answer, of course, is that companies should tread carefully. Any company that deploys Predictive AI would be well-advised to follow a documented, well-reasoned approach to the development and validation of their model(s) and follow best practices that necessarily avoid the past errors of others.

One of the core problems identified by Narayanan and Kapoor in their recent computer science publications is the problem of “Data Leakage.” In predictive machine learning, data typically consists of “features” (independent variables) and “labels” (dependent variables, i.e. the thing being predicted). A core tenet of building predictive machine learning models is that data must be split between “training data” that is used to train the model and “test” data that is used to evaluate the effectiveness of the trained model. The reason for splitting the data in this fashion is so that an objective basis exists upon which to determine a model’s effectiveness. However, in order for the split between “training data” and “test data” to work, information about the “training data” cannot find its way into the “test data”—because if it does, the model will appear to have exaggerated effectiveness, often by a large margin. This phenomenon, called “Data Leakage,” can happen in extremely subtle and unexpected ways. It is one of the primary ways that Predictive AI can end up not being very predictive. Narayanan and Kapoor provide a detailed taxonomy of different data leakage failure modes in their paper.

In those instances where Data Leakage occurs, it can cause a company or researcher to think that their model is highly predictive, when, in fact, it is not. In a worst case scenario, a company makes claims about what its model can do, and those claims end up being entirely false—exposing the company to a wide spectrum of risk. Given how easy it is to ruin one’s machine learning model with Data Leakage, we briefly discuss each of the Data Leakage categories identified by Narayanan and Kapoor. In each case, we also recast the category as a positive rule. 

No test data—This is the classic, epic fail. When there is no data split to enable objective testing, and the model is evaluated based on the data used to train it, there is no objective basis upon which to assess the effectiveness of the model.

  • Positive rule: Split data between training and test sets using best practices.

Sneaky duplicates in the data—This is when some of the same individuals—or their features—are represented in both the training and the test data. This can be more elusive than it looks. For example, in complex data sets, the same individual may be represented in both data sets as distinct records, but with features and labels measured at different points in time or as different instances.

  • Positive rule: Individuals represented in the training set must not be represented, in any way, in the test set.

Preprocessing of training data and test data is done together—Preprocessing can include various types of transformations to make the data more digestible for training or predictions. However, implementing certain transformations before data is split between training and test sets means that some aspects or properties of the training data end up getting represented in, or “leaked” to, the test data. 

  • Positive rule: Preprocessing must occur after the data set is split between training data and test data. Similarly, no aspect, statistic, or property of the training data can be used in the preprocessing of the test data.

Feature selection is done on training data and test data together—When the decision is made as to which features should be used to train a model before the split is done between training and test data, one is, in a roundabout way, peeking at the test data to determine which features are most impactful. Once again, the boundary between training and test data has been breached.

  • Positive Rule: Feature selection must occur after training data and test data are split, and without resort or reference to test data.

The training data uses features that are a “proxy” for the outcome—If a data set is used to train a model regarding who has high cholesterol, and one of the independent variables/features is whether the person takes a statin drug, then information about the outcome, i.e., whether the person has high cholesterol, is already implicit in one of the features.

  • Positive rule: Ensure that all features are not simply proxies for the outcome variable.

The training data are taken from a different population than the test data—If training data and test data are from entirely separate populations (e.g. all the training data was men, and all the test data were women), there is no basis on which to assess the accuracy of the trained model.

  • Positive rule: The training and test data must be taken from the same population as to which the Predictive AI claims are being made.

Temporal leakage—If the model aims to test “future” data, then the test data cannot be taken from a period prior to training. This is another example of leakage.

  • Positive rule: Where temporal status is part of the model, then temporal barriers should be respected in the selection of training and test data.

The extended explanation of these issues can be found in Narayanan and Kapoor’s REFORMS: Consensus-Based Recommendations for Machine-learning-based Science article.

Final thoughts

There are many ways to get machine learning/Predictive AI wrong, and Data Leakage is just one, albeit an important one. Other issues can include trying to use ML/AI in a context where there is too much inherent randomness at work to ever make successful predictions. Other issues can arise from starting with an erroneous problem specification or sampling bias. None of these issues relate to “personal information” or “privacy” per se, yet all of these problems can cause a model to fail in its essential purpose and expose companies to significant litigation and regulatory risk.