trying to do something you probably shouldn't
goodhart's law, slight return
This post from Jenn Pahlka (detailing a debate with Dave Guarino, who is also very good) reminded me that it’s been a while since I talked about Goodhart’s Law. As Jenn says, it’s awfully tempting to say that “we just need to get the metrics right”, but the history of attempts to do this is pretty disappointing. Does that mean that it’s a chimera, or are there some ways of creating good measurement dashboards, which either remain resilient for a useful amount of time, or break down in easily diagnosable ways and tell you that you need to revise the metrics?
In the early days of this ‘stack, I pointed out that “when a measure becomes a target it ceases to be a good measure”, while catchy, can’t be right. Goodhart was one of the international advisors on the drafting of New Zealand’s Reserve Bank Act 1989, the birth of modern inflation targeting; he obviously can’t be recruited for a project of opposition to targets in general, even to targets based on quite controversial and inaccurate measures (like CPI inflation).
The “ceases to be a good measure” formulation was attributable to the anthropologist Marilyn Strathern. Goodhart’s original statement was “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”. As I bored on a while ago, I very much prefer Goodhart’s original version, which makes it clear that we are talking about statistical regularities, and about their use for control. If you are running a vaccination program and targeting a reduction in the number of smallpox cases, this doesn’t cease to be a good measure of the number of smallpox cases.
So what’s going on here, and can I do the “John Henry was a steel drivin’ man” competition to come up with a better analysis for Jenn than Claude did?
My first thought here was that there is an old backgammon proverb to the effect that if you’re in a position of uncertainty between move A and move B, this usually means that you should instead be looking for C which is better than both. (The idea is that most backgammon positions have a single definitely superior move to play, and then a host of more or less equivalent mediocre alternatives. I’m sure this could be analysed statistically but I haven’t seen any such proof, just the proverb).
In the context of measures and management, I would say that if you are trying to assess whether a metric is valid or whether it’s subject to Goodhart’s Law, this is likely to be evidence that you need more structural change in how you’re managing the thing.
This all relates to which kinds of metrics can and can’t be gamed; as I discussed above and in the old post, it’s important to see that Goodhart’s Law is really a statement about intermediate and second-order measures. An output is an output – if you are measuring the thing that you care about, like smallpox cases or CPI inflation, then you might have problems with respect to whether that’s an accurate measurement, but this isn’t the same thing as Goodhart’s Law. In education research, I’ve often made the case that the phenomenon of “teaching to the test” in a pejorative sense, ought to be called “not testing the right thing”.
Similarly, an input is an input. Like a direct measurement of output, a direct measurement of input is something you care about, it just enters into the calculation with the opposite sign. Again, some inputs are difficult to measure, because they include things like “cognitive demand on the manager” or “public trust and goodwill”, but the problems here are not problems of Goodhart’s Law. Inputs and outputs are measurements, but they aren’t “statistical regularities”, they’re not proxies for anything and they can’t be gamed.
And so, the best way to deal with Goodhart’s Law is to go back to Stafford Beer – “there is nothing to be gained from opening the black box”. If you set a resource bargain, then you do this by specifying the output you want, and the amount of inputs that are available. Neither of these are the kinds of measures that are subject to Goodhart’s Law. The nature of the resource bargain is that the unit that you have delegated the task to has discretion about how it generates outputs, subject to not using more inputs than the wider system can spare.
In other words, worrying about Goodhart’s Law is, at some level, a sign that you are trying to do something which you perhaps shouldn’t. You’re opening up the black box and micromanaging; in a lot of cases, the kind of subversive behaviour that makes the measures no longer useful is an attempt by the subsystem to route around the damage that you’re causing.
I think this argument works, in a purist kind of way. But how many of us are quite so lucky as to be able to deal with our problems by reorganising and delegating to a completely trustworthy subsystem? Not one in ten thousand. (Even the deadweight cost of all the restructurings is something that Beer occasionally seems a bit handwavey about for my taste).
If you’re in the position of needing to use statistical regularities and indirect measures, what do you do? I think a good initial step is to understand in your own mind (although perhaps to find a tactful way of saying it in meetings) that the basic problem here is that you don’t trust the people you’re managing, and that the real goal has to be to build good enough trust that you can return to the black-box system of a resource bargain. But with that in mind, I actually think Jenn’s Claude summary is quite good; redundant measures, regularly retired and updated and used as the trigger for deep-dive investigation rather than “high stakes” action are what you need.
These are all, of course, what Stafford Beer would describe as “variety engineering”. The problem of Goodhart’s Law is only one way in which you can go wrong if you are trying to control a complex and high-variety system using a narrow bandwidth information channel.Which is to say once more – it’s not really a law, more an alert to the fact that you’re trying to do something you probably shouldn’t.

Two thoughts:
1. The statistical regularity between "good economy" and "two percent inflation" does seem to perhaps be collapsing due to the control pressure placed on it.
2. Sometimes the problems of measurement and trust really are unfixable. Teaching to the test is indeed a problem of testing something other than what you want people to accomplish, but it's also genuinely impossible to "test" the things we want education to accomplish, especially at the scale of our modern societies.
I like the nuance in the shift from 'measure becomes a target' to 'collapse of statistical regularities'.
Almost all the things you can measure you can only measure indirectly. To take your example, you can't measure the number of smallpox cases, you can only measure the number of *detected* smallpox cases, or the number of *reported* cases. I could reduce the number of detected smallpox cases trivially by reducing testing. I could reduce the number of reported cases by all sorts of statistical and reporting aggregation shenanigans, or straight up by making it clear to the reporters that Very Bad Things will happen if they report any.
> redundant measures, regularly retired and updated and used as the trigger for deep-dive investigation rather than “high stakes” action are what you need
This helps. In small enough contexts, you can get a long way with indirect measures and trust, making it very clear that attempts to game the metrics will be regarded as a very significant breach of trust, much worse than underperformance.
In bigger contexts interpersonal trust is harder - although how big is variable and I have seen what I think you call the Canada Paradox inside organisations (a high trust environment will, paradoxically, see more egregious fraud).
Anyway, where you don't have as much of that, you can lean towards measures that are harder to influence by routes other than the intended one. The central bank is accountable for CPI but there are obvious problems if you make the central bank responsible for compiling inflation statistics. (Anyone who has opened the box of horrors that is CPI methodology will see that it is very much an indirect and imperfect measure of anything, let alone 'inflation'.)
I would say "we have to get the metrics right", but there is no "just" about it: it's a deeply complex, context-specific exercise that very much is not a once-and-done thing (and sometimes less is more).