Literature Review: Statistical Modeling - The Two Cultures

Discussing my thoughts on various academic papers.

Seouk Jun Kim
4 min readMay 9, 2020

As a paper written in 2001, Breiman’s Statistical Modeling: The Two Cultures conveys ideas that are more readily accepted by today’s data science communities. In fact, algorithmic models are now so widely used: think of the deep learning fad. However it is this prevalence of algorithmic models in our present that blurs the differences between many models and blinds us from the actual reasons behind selecting algorithmic models.

Breiman’s paper was enlightening for me, especially, because I was especially confused with the logistics of choosing the right approach for different problems. Having no formal education on any statistics or data science course at all, I lacked confirmation on any part of data prediction project process. While Breiman’s paper does not intentionally touch on those issues, the descriptive manner in which he addresses his arguments gives considerable insight.

Summary

Breiman first distinguishes between two models: data model and algorithmic model.

The main characteristics of each model can be described as the following:

  1. Data Model
  • Better interpretability
  • Used by majority of statisticians (98% as of 2001)
  • Simplified representation of nature’s algorithm
  1. Algorithmic Model
  • Lack of interpretability
  • Used by a small population of statisticians (2% as of 2001)
  • More complex representation of nature’s algorithm

Breiman goes further on to argue that while it is true that data models give us a clearer picture of what goes on during the prediction process, it sometimes gives us misinterpretation of nature’s mechanism. For example, he points at the feature extractions used on multiple linear/logistic regression models of various projects. He points out that multiple projects using feature extraction on linear regression model to a given problem often come up with different answers as to which features impact the outcome the most. Furthermore it sometimes appears that different models with almost equal prediction accuracy have vastly different mode structure. He calls this the multiplicity of data models.

If data models are giving us a wrong representation of nature’s mechanisms, then it does not matter that they present a clearer picture of what they are doing; it is just going to be a clear, but wrong representation.

The most important metric to consider, Breiman argues, is prediction accuracy. It is through high prediction accuracy that we can confirm a model’s representability of nature’s black box. And in terms of producing the best prediction accuracy, algorithmic models such as random forests excel. Multiplicity of models still exist, as algorithmic models also take different shapes despite having similar prediction accuracies, but feature extraction is more consistent and meaningful. Furthermore, sometimes algorithmic models benefit from high dimensionality, which has been what most data models have sought to avoid.

The conclusion that Breiman draws is that if we are to use modeling to impact policy decisions, then we should be aware that algorithmic models tend to perform better than data models in mimicking nature. It certainly does not mean that data models are inferior, but when we cannot be certain that we are framing the process correctly with data models, then we should be ready to admit the complicatedness of the process and consider using algorithmic models. In short, Breiman’s main purpose with this paper was to open the statistician community’s eyes to the possibility of algorithmic models.

Impressions

First off, prior to reading the paper I did not make clear distinctions between data models and algorithmic models; they just fell under the same hood of machine learning. And when you think about it, there is not much difference between the two in essence. Logistic regression and neural networks, for example, have very similar structure, with NN (neural network) mainly having more complexity in comparison. The reason for differentiating between the two models is based on the recognition of the fact that with complex problems, complex solutions are needed, even though we might not get to understand it entirely.

It actually makes a lot of sense. Think of image classification problem, for example. It is obvious that a lot of features will come into play for a machine to recognize a cat, and the model for it will definitely be immensely complex. Trying to come up with a slim, refined model that can substitute a complex model that nevertheless gets the job done, is complicating a problem that is already complex.

There are other takeaways from this paper.

First, the learned weights of a feature does not directly correlate to the feature’s importance. What matters is how much prediction accuracy increases with the addition of one feature. This can be seen from example 1 in 11.1 section, where Breiman took an approach reflecting the statement above to extract the most important features correctly.

Second, a complex, but accurate model could provide more insight in interpretation. If the model is inaccurate, biased, overfitted, etc, then the information gathered from that model is flawed. It cannot then help that much in interpretation of the problem.

Originally published at http://armystudies.wordpress.com on May 9, 2020.
Edited from original content on Medium.

--

--

Seouk Jun Kim

Columbia University student, CS. Former iOS developer, presently pursuing career in data science. https://www.linkedin.com/in/seouk-jun-kim-a74921184/