The Mystery of Ensembles in Data Analysis
What is the role of ensembles when it comes to data analysis? Steven Struhl, author of Artificial Intelligence Marketing and Predicting Consumer Choice, explains.
Ensembles may seem like something of a mystery to many readers. Some might hazard that these are related to music, but are otherwise an enigma.
We can at least clear up the name. When used with data analysis, this term refers to many methods that combine the estimates of numerous models by either averaging or voting.
However, what these methods actually do remains largely hidden.
All ensemble methods capitalize on one of the key discoveries from machine learning: the average of many indifferent or weak models typically is better than any of the individual models. Ensemble methods usually run dozens or hundreds of models and get a consensus from them.
Ensembles share one disconcerting property: while they may perform amazing feats, they typically tell us little or nothing intelligible about what they are actually doing. We must take the output and trust that the means to reach it are sound. This trust can be validated in operations where a machine must navigate the real world.
The preeminent example of this is a self-driving car. Either it gets you there safely, or it gets lost, or it stops entirely, or it crashes. Depending on its behaviour, we know whether all its estimates together are working.
When we are navigating data, however, results typically are more ambiguous.
For instance, suppose you devise a fiendishly clever recommendation engine (these are the algorithms that suggest what you might buy on a site like Amazon or view on a service like Netflix). Since most people do not rush to buy or use all the recommendations, how exactly do you judge how well you are doing vs. some theoretically best level of performance?
You could run a contest between the recommendation engine, random picks, just putting up the most popular items, and choices by an expert (an actual human being). However, even if your engine won, this would still not tell you how far you were from some theoretical optimum.
So far, in many real-world applications, it seems impossible to know the size of the gap between performance and some absolute best. Excluding the method of making random picks, some comparisons between the other three methods have been made, and so far, there is no clear winner.
When we deal with trying to predict a quantity, such as household purchases of boxes of SoggyOs (everybody’s favourite breakfast-like substance), we can do some testing using hold-out samples or cross-fold validation. These methods take some of the data set, put it to one side (or hold it out), and then build the model on the rest of the data. They finally test the model on the set that was held aside.
The idea is that the hold-out sample will not have all the peculiar features in the portion of the data used to make the model. Predicting little bumps and oddities peculiar to one data set is called over-fitting the data. The thinking goes that testing in this way will give some idea of how the model will work in the outside world.
We discussed several ensemble methods in Artificial Intelligence Marketing and Predicting Consumer Choice. One, with the intriguing name Decorate, did extremely well in predicting an outcome in a cross-fold-validated model test. It in fact did the best of all methods we tested.
However, we had absolutely no idea what it actually did except in a broad outline.
Basically, this method works in many steps. First, it runs a model of some type and gets predictions; then it examines the results and adds another smaller set of artificial data to the basic data set. This second set contradicts the first set in some instances. This set of original and contradictory data requires some adjustment of the first model to make a new best prediction. Then the procedure adds more artificial data, and runs again, and again. Finally, it reaches some pre-specified stopping point.
Working with all this extra diversity in the combined actual and artificial data sets actually works to produce strong predictive models. However, all you can do with this is save the set of steps it took and apply them to another data set.
There is no model that you can examine and check against common sense, as you have with other methods, such as Bayesian networks or classification trees.
We also discussed the random forests method, which can give a very strong reading of how strongly variables affect a target or outcome. This runs hundreds of classification trees, randomly swapping predictors and people into and out of the model. Then all the trees get to vote on the final outcome. All this sampling and swapping allows us to get both an overall correct prediction score and an accurate fix on variables’ effects. However, we could not look across the several hundred trees involved and see anything that we would even faintly understand.
Whether you use ensembles or not depends on your goals. If you want to squeeze out the last possible drop of predictive accuracy, and are not concerned with what is inside the model, they may be the way to go.
However, if you want to be able to check the model vs. what you or various experts know or believe to be so, you would want to turn to more visually transparent methods. These include any of the many regression-based methods, or the classification trees and Bayesian networks that we mentioned earlier.