I agree that most of the time the problem is that your model is bad and you should feel bad, but I think 'overfitting' is still a useful concept to have in your mind as an explanation for what just happened when your kitchen-sink ML model fits absolutely beautifully in training and just doesn't on the validation/test data. The moral high ground says not to train models by throwing the kitchen sink in to the neural net blender with no thought about what features you think ought to fit, but if you went with that you'd have missed most of the huge gains in AI over the last decade or two. We are still struggling to make any sense whatsoever of what features LLMs have ended up learning about language (and maybe indirectly the world?!) but they sure can generate very plausible text.
I think I agree with Ben that in that situation you should blame the data. Like voters, data is often a bunch of bastards with nothing realistic to say about the future.
Actual lol there. Very true. Although I would hesitate to describe it like that to principals. When you have a complete set of actual data points you are in the same bind as you are with voters. We might not like the fact that they have nothing coherent to say to us in aggregate, but you can't just throw them out and get better ones.
At least with voters it is legitimate to attempt to shape their views to conform to your own expectations, although as a data professional I am dismayed when such attempts are cloaked as neutral efforts to merely measure them.
Also as a data professional, I would of course never dream of attempting to shape the data to conform to my expectations or those of my clients. We have LLMs to do that for you now anyway.
I think there's maybe another ML phenomenon that falls under the "overfitting" umbrella that's also potentially real enough to be worth naming, and that's "the point at which the accuracy on the test data starts to get worse even as the accuracy on the training data keeps improving".
Isn't the absence of explicit theory in machine learning (roughly, discriminant analysis on steroids) central to the problem. There's always an implicit theory but if it is just "all these variables must be related in some way", you're committed to overfitting from the start.
Two cures for the problem - fractional polynomials as discussed by Royston and abductive data exploration. The FP are a good compromise between non-parametric over fitting and the limited flexibility of standard parametric models that do not.do very well in capturing weird stuff in the data. Abductive reasoning allows for searching for the best plausible explanation that could enrich theory rather than trying to do a pretzel to explain weird data to fit existing theories. Of course, to economists, abductive reasoning probably sounds like alchemy ;-)
«I found myself sitting next to a very likable young middle-aged academic tenured at an elite British university, whom henceforth I will refer to as Doctor X and whose field is closely associated with this blog. ... Every year I publish papers in the top journals and they’re pure shit.” Doctor X, who by now had had a glass or two, felt bad about this, not least because “students these days are so idealistic and eager to learn; they’re really wonderful.” Furthermore Doctor X could and would like “to write serious papers but what would be the point?” ... The amount of funding Doctor X’s department receives depends not on how many papers or their quality its members publish, but instead on in which journals they are published. The journals in Doctor X’s field in which publication results in substantial funding will not publish “serious papers” but instead only “pure shit” papers, meaning ones that merely elaborate old theories that nearly everyone knows are false. Moreover, even to publish a “serious paper” in addition to the “pure shit” ones could taint the department’s reputation, resulting in a reduction of its funding. In any case, no one at a top university would read a “serious paper” because they only read “top journals.”»
Maybe I need to read the original post, but this aligns with my traditional understanding of overfitting in statistics: your model is too flexible or doesn't have enough penalisation meaning that it fits the noise rather than the actual process. Therefore, it generalises to new data (even from the same distribution) poorly.
When I used to look at the results coming out of big simulation models (CGE, trade models and similar) one thing I observed was that, almost invariably, 90 per cent of the action could be explained by a couple of key relationships and the way the model was closed.
Great post as always, Dan. I particularly like the tension you articulate between inference and prediction.
Inference: If you are trying to test and confirm theories, you can "overfit" to your prior by only looking at the confirmatory data sets.
Prediction: In machine learning, we don't care about theories. Machine learning is the wholly atheoretical prediction of the future from examples. In this case, any theory gets subliminally laundered into the data.
It's a weird field! But it can be remarkably powerful to detach yourself from causal theories. I can't tell you how to write a C program to determine if a jpeg contains a cat. But if I collect a million cat images, I can build a machine learning model that will do a great job.
It's a good (slightly contrarian) take, but I think, as others have said, that there is a useful failure mode hiding under that semantic bloat. You are interested (e.g. with some sort of linear/general linear model) in recovering the parameters and predictor variables of the data generating process. You can add too many variables to the regression to improve the fit to your training data and reduce the predictive accuracy against your test data, which implies that some of your predictors are not helping you capture the true data generating process. I would agree that overfitting has been unhelpfully conflated with model misspecification though
#3 is the most common situation in my experience with academic work. After all, we have to show some results for our efforts, and saying "my theory is wrong" without proposing an alternative is typically frowned on. Instead of admitting, "I don't have an alternative" we'll go with "here's what the data show."
I can go with overfitting is just “wrong model” but I still kind of like it because it points to the danger that a nice mathematicallly generated curve can fit the data but have little explanatory power.
Overfitting is a euphemism for prejudice (or in today’s parlance: ’priors’). Having said that, having priors is not a sin; not adjusting them when they fail is.
I agree that most of the time the problem is that your model is bad and you should feel bad, but I think 'overfitting' is still a useful concept to have in your mind as an explanation for what just happened when your kitchen-sink ML model fits absolutely beautifully in training and just doesn't on the validation/test data. The moral high ground says not to train models by throwing the kitchen sink in to the neural net blender with no thought about what features you think ought to fit, but if you went with that you'd have missed most of the huge gains in AI over the last decade or two. We are still struggling to make any sense whatsoever of what features LLMs have ended up learning about language (and maybe indirectly the world?!) but they sure can generate very plausible text.
I think I agree with Ben that in that situation you should blame the data. Like voters, data is often a bunch of bastards with nothing realistic to say about the future.
Actual lol there. Very true. Although I would hesitate to describe it like that to principals. When you have a complete set of actual data points you are in the same bind as you are with voters. We might not like the fact that they have nothing coherent to say to us in aggregate, but you can't just throw them out and get better ones.
At least with voters it is legitimate to attempt to shape their views to conform to your own expectations, although as a data professional I am dismayed when such attempts are cloaked as neutral efforts to merely measure them.
Also as a data professional, I would of course never dream of attempting to shape the data to conform to my expectations or those of my clients. We have LLMs to do that for you now anyway.
Somewhat related: throwing the kitchen sink at prediction problems is often remarkably effective.
https://papers.nips.cc/paper_files/paper/2008/hash/0efe32849d230d7f53049ddc4a4b0c60-Abstract.html
I think there's maybe another ML phenomenon that falls under the "overfitting" umbrella that's also potentially real enough to be worth naming, and that's "the point at which the accuracy on the test data starts to get worse even as the accuracy on the training data keeps improving".
Isn't the absence of explicit theory in machine learning (roughly, discriminant analysis on steroids) central to the problem. There's always an implicit theory but if it is just "all these variables must be related in some way", you're committed to overfitting from the start.
Two cures for the problem - fractional polynomials as discussed by Royston and abductive data exploration. The FP are a good compromise between non-parametric over fitting and the limited flexibility of standard parametric models that do not.do very well in capturing weird stuff in the data. Abductive reasoning allows for searching for the best plausible explanation that could enrich theory rather than trying to do a pretzel to explain weird data to fit existing theories. Of course, to economists, abductive reasoning probably sounds like alchemy ;-)
«battling against a simpler but wrong theoretical understanding, to get them through to the mainstream consensus»
http://rwer.wordpress.com/2013/06/30/doctor-x-pure-shit-and-the-royal-societys-motto/
«I found myself sitting next to a very likable young middle-aged academic tenured at an elite British university, whom henceforth I will refer to as Doctor X and whose field is closely associated with this blog. ... Every year I publish papers in the top journals and they’re pure shit.” Doctor X, who by now had had a glass or two, felt bad about this, not least because “students these days are so idealistic and eager to learn; they’re really wonderful.” Furthermore Doctor X could and would like “to write serious papers but what would be the point?” ... The amount of funding Doctor X’s department receives depends not on how many papers or their quality its members publish, but instead on in which journals they are published. The journals in Doctor X’s field in which publication results in substantial funding will not publish “serious papers” but instead only “pure shit” papers, meaning ones that merely elaborate old theories that nearly everyone knows are false. Moreover, even to publish a “serious paper” in addition to the “pure shit” ones could taint the department’s reputation, resulting in a reduction of its funding. In any case, no one at a top university would read a “serious paper” because they only read “top journals.”»
Maybe I need to read the original post, but this aligns with my traditional understanding of overfitting in statistics: your model is too flexible or doesn't have enough penalisation meaning that it fits the noise rather than the actual process. Therefore, it generalises to new data (even from the same distribution) poorly.
When I used to look at the results coming out of big simulation models (CGE, trade models and similar) one thing I observed was that, almost invariably, 90 per cent of the action could be explained by a couple of key relationships and the way the model was closed.
Digression, but would be really interested in a post listing some of the examples you're referring to in the footnote.
Great post as always, Dan. I particularly like the tension you articulate between inference and prediction.
Inference: If you are trying to test and confirm theories, you can "overfit" to your prior by only looking at the confirmatory data sets.
Prediction: In machine learning, we don't care about theories. Machine learning is the wholly atheoretical prediction of the future from examples. In this case, any theory gets subliminally laundered into the data.
It's a weird field! But it can be remarkably powerful to detach yourself from causal theories. I can't tell you how to write a C program to determine if a jpeg contains a cat. But if I collect a million cat images, I can build a machine learning model that will do a great job.
It's a good (slightly contrarian) take, but I think, as others have said, that there is a useful failure mode hiding under that semantic bloat. You are interested (e.g. with some sort of linear/general linear model) in recovering the parameters and predictor variables of the data generating process. You can add too many variables to the regression to improve the fit to your training data and reduce the predictive accuracy against your test data, which implies that some of your predictors are not helping you capture the true data generating process. I would agree that overfitting has been unhelpfully conflated with model misspecification though
#3 is the most common situation in my experience with academic work. After all, we have to show some results for our efforts, and saying "my theory is wrong" without proposing an alternative is typically frowned on. Instead of admitting, "I don't have an alternative" we'll go with "here's what the data show."
I can go with overfitting is just “wrong model” but I still kind of like it because it points to the danger that a nice mathematicallly generated curve can fit the data but have little explanatory power.
This post reminds me of P.A.M. Dirac's: "it is more important to have beauty in one's equations than to have them fit experiment."
Overfitting is a euphemism for prejudice (or in today’s parlance: ’priors’). Having said that, having priors is not a sin; not adjusting them when they fail is.