A slightly longer Friday thing which grew out of a shorter one I basically wrote to point in the direction of Ben Recht’s great post this week. In which he argues that the concept of “overfitting” is basically incoherent – when you have a model which performs well on its training data but then falls apart when exposed to the real world, that’s just called “being wrong”.
I had not thought about this at all, and had pretty much the same lazy assumptions about “overfitting” as almost anyone else, but reading the post convinced me utterly, and made me re-assess a lot of occasions on which I and others earnestly discussed “overfitting” as a potential cause of whatever problems we were having. Here’s a short list of the sort of thing I’m talking about:
1. The data has one massive feature to it (a spike, or a sudden step change or something), so any model that is capable of generating that feature will fit reasonably well.
2. The data has two or three massive features to it, so every model will fit by adjusting parameters to fit those and nothing else.
3. The data has no real features to it at all, which also means that more or less any sufficiently flexible model will fit to the noise as well as any other, and the more flexible the model the better it will fit the noise.
4. You’ve just been unlucky and happen to have a dataset which fits a wrong model better than the correct one.
5. You did nothing wrong, but the world has actually changed. (Ben notes that one reason for this might be that you forgot about Goodhart’s Law, and that your very modelling was part of a wider project which caused the underlying system to change).
The big problem in all of these is nothing to do with the fitting – the problem is that the model’s wrong! The mistake wasn’t in the way that you estimated it.
Oddly, this makes me think I can save the term “overfitting” for use in a very limited variety of contexts. In machine learning contexts, where you’re not making any structural or theoretical assumptions about the underlying data generating process, I think this critique is unstoppable – overfitting is just a bad concept, don’t use it.
But in contexts (like economics, which I’m most familiar with) where the model you’re fitting has quite a lot of embedded structural assumptions, I think there is a place for saying that “being overly convinced by a superior fit to the data of a model with unattractive theoretical properties” might be a methodological sin. “It works in practice, but does it work in theory?” is often a very good question to ask.
And the reason I find myself wanting to defend “overfitting” is because I’m thinking of an exception that proves the rule. If we think of the work of Card and Krueger on the minimum wage, then one of the things that economists exactly got wrong was dismissing a study that fit the data very well, but which had unattractive theoretical properties. And the way that the profession finally convinced itself[1] that one of the alternative models of wages and employment had to be right was precisely the replication of Card & Krueger’s New Jersey study in other situations.
So when you have something that doesn’t work in theory but want to know if it works in practice (to the extent that you need to revise the theory), then a big part of that debate is establishing whether it’s dependent on a single dataset. I still don’t (now that I’ve been prompted to think about it) think that “overfitting” is a very good word to describe that kind of debate, but it’s what we have and so I think I’ll allow it.
[1] It troubles me more and more that there are quite a lot of results in economics which are fairly well understood and appreciated by small groups of specialist cognoscenti, but where nobody has yet been up to doing Card & Krueger’s job in battling against a simpler but wrong theoretical understanding, to get them through to the mainstream consensus.
I agree that most of the time the problem is that your model is bad and you should feel bad, but I think 'overfitting' is still a useful concept to have in your mind as an explanation for what just happened when your kitchen-sink ML model fits absolutely beautifully in training and just doesn't on the validation/test data. The moral high ground says not to train models by throwing the kitchen sink in to the neural net blender with no thought about what features you think ought to fit, but if you went with that you'd have missed most of the huge gains in AI over the last decade or two. We are still struggling to make any sense whatsoever of what features LLMs have ended up learning about language (and maybe indirectly the world?!) but they sure can generate very plausible text.
It's a good (slightly contrarian) take, but I think, as others have said, that there is a useful failure mode hiding under that semantic bloat. You are interested (e.g. with some sort of linear/general linear model) in recovering the parameters and predictor variables of the data generating process. You can add too many variables to the regression to improve the fit to your training data and reduce the predictive accuracy against your test data, which implies that some of your predictors are not helping you capture the true data generating process. I would agree that overfitting has been unhelpfully conflated with model misspecification though