trying to do something you probably shouldn't

Dan Davies

Apr 22

goodhart's law, slight return

Read →

22 Comments

Sam Tobin-Hochstadt

Apr 22

Two thoughts:

1. The statistical regularity between "good economy" and "two percent inflation" does seem to perhaps be collapsing due to the control pressure placed on it.

2. Sometimes the problems of measurement and trust really are unfixable. Teaching to the test is indeed a problem of testing something other than what you want people to accomplish, but it's also genuinely impossible to "test" the things we want education to accomplish, especially at the scale of our modern societies.

Reply (3)

Indy Neogy

Apr 22

I think (2) is really important - we've lost track of the fact that centralised control costs resources. The IT age has mesmerised people into thinking that you can in fact micromanage everything. However, all sorts of things are not amenable to cheap metrics.

Reply (1)

Sam Tobin-Hochstadt

Apr 24

This is right, but a different problem I see often is the belief that we could just do away with the challenges of having big organizations by having small ones instead. My kids go to a school system with ~8000 students, and some metrics are needed for management there. But the NYC schools have 100 times as many students and obviously cannot be managed without the tools of modern organizational life.

Reply (1)

Indy Neogy

Apr 24

Oh yes, I sometimes forget I'm yet to write my big polemic against "small is beautiful" school of complexity response. Agree - my point is that we may have overreached on centralisation. New tech means you don't have to devolve power/control/etc as much as in 1970 - but maybe you still need to do it more than we often think.

Reply (1)

Sam Tobin-Hochstadt

Apr 24

To align this with a different comment thread, I think what's going on here is that in trying to figure out how to manage big organizations, people have said "instead of needing high quality and trustworthy managers at every level, we can just produce good metrics that can be assessed at the top". And the "small is beautiful" people are basically saying "instead of needing accurate and reliable metrics, we can just have good leaders at every small institution".

The solution is realizing that there is no "just".

Reply (1)

Matt Woodward

Apr 24

Thesis: the larger the organisation, the more resistant it is to central control. Humans are not good organisational building blocks once you scale past ~Dunbar’s number, and just as with building a structure out of beer cans, there’s only so big an organisation you can build with people before the only “effective” approach is an untidy pile that’s slowly collapsing under its own weight.

Reply (1)

Sam Tobin-Hochstadt

Apr 24

I don't think this is true, there are certainly examples of large human organizations -- the US Army is maybe the paradigmatic one.

The inflation targeting example had also occurred to me but in a different way. In my opinion, inflation targeting hasn't eliminated variations in the economy, but it has reduced their magnitude. I interpret this as follows: when there is a causal linkage between an intermediate or secondary measure and a final output, but that linkage is only partial, then applying cybernetic control of the measure will, by hypothesis, leave you with the residual variations in output that are not linked to the measure.

You can think of this as a kind of selection bias. A similar example would be selecting on grades to predict future scholastic performance. This might work well when selecting on secondary school performance to predict undergraduate performance, but apparently undergraduate grades are not a good predictor of the academic prospects of grad students. But of course, the grad students all had very good grades. We can't "see" the academic performance of undergraduates with mediocre grades because they were selected out by the control mechanism.

Matt Woodward

Apr 23

I'd go further and say that *usually* the problems of measurement and trust are unfixable, at least if we're talking about large orgs, because the target is trying to externally coerce a change in behaviour in a viable system, and that system's natural response will always be to minimise both the degree to which it suffers from the coercion and the degree to which it needs to actually change its behaviour.

The strong version of Goodhart's law matches my anecdotal experience well: formal targets ~always indicate (and often directly create!) an adversarial relationship, so the targetee will ~always seek to game them, and thereby break the correlation between measurement and desired outcome that the target rests on.

Doug Clow

Apr 22

I like the nuance in the shift from 'measure becomes a target' to 'collapse of statistical regularities'.

Almost all the things you can measure you can only measure indirectly. To take your example, you can't measure the number of smallpox cases, you can only measure the number of *detected* smallpox cases, or the number of *reported* cases. I could reduce the number of detected smallpox cases trivially by reducing testing. I could reduce the number of reported cases by all sorts of statistical and reporting aggregation shenanigans, or straight up by making it clear to the reporters that Very Bad Things will happen if they report any.

> redundant measures, regularly retired and updated and used as the trigger for deep-dive investigation rather than “high stakes” action are what you need

This helps. In small enough contexts, you can get a long way with indirect measures and trust, making it very clear that attempts to game the metrics will be regarded as a very significant breach of trust, much worse than underperformance.

In bigger contexts interpersonal trust is harder - although how big is variable and I have seen what I think you call the Canada Paradox inside organisations (a high trust environment will, paradoxically, see more egregious fraud).

Anyway, where you don't have as much of that, you can lean towards measures that are harder to influence by routes other than the intended one. The central bank is accountable for CPI but there are obvious problems if you make the central bank responsible for compiling inflation statistics. (Anyone who has opened the box of horrors that is CPI methodology will see that it is very much an indirect and imperfect measure of anything, let alone 'inflation'.)

I would say "we have to get the metrics right", but there is no "just" about it: it's a deeply complex, context-specific exercise that very much is not a once-and-done thing (and sometimes less is more).

Reply (2)

Sam Tobin-Hochstadt

Apr 22

Many of the problems we face are caused by people searching for a way to put the "just" back in sentences like that.

Blissex

Apr 22

«I like the nuance in the shift from 'measure becomes a target' to 'collapse of statistical regularities'. Almost all the things you can measure you can only measure indirectly. To take your example, you can't measure the number of smallpox cases, you can only measure the number of *detected* smallpox cases, or the number of *reported* cases.»

Great points.

«I could reduce the number of detected smallpox cases trivially by reducing testing. I could reduce the number of reported cases by all sorts of statistical and reporting aggregation shenanigans, or straight up by making it clear to the reporters that Very Bad Things will happen if they report any.»

For me the critical detail here is the wondering about the *motivation* for doing this. In my simplistic guess it is usually because of incentives or penalties as to making or losing money, that is conflicts of interests.

«This helps. In small enough contexts, you can get a long way with indirect measures and trust, making it very clear that attempts to game the metrics will be regarded as a very significant breach of trust»

In my simplistic mindset this is an argument not about "trust" but for countering potential incentives to twist the metrics with potential penalties for getting caught in order to reduce the conflict of interest between "subsystem" (agent) and "system" (principal). IIRC this was known in antiquity as the "sharecropper problem".

The Backseat Policy Critic

Apr 23

One of the most interesting things I’ve found from studying the ‘great British generals’ (Marlborough, Wellington and Montgomery) is that they all tended to organise their headquarters in similar ways that were nonetheless fairly unique amongst their peers - particularly relating to this post though was their extensive use of ‘liaison officers’.

Montgomery for example would go round his various units and effectively recruit a bunch of intelligent captains in their early twenties and attach them to his own HQ as his liaison officers. Each one would be assigned a particular unit, and their job effectively was to head over the the relevant unit command each day and basically just have a chat with the commanding officers and his team in order to get a full on the ground understanding of what exactly was going on, before reporting back to Monty in the afternoon to explain the gist.

This effectively cut out both all the limitations of trying to use quantitative measures, allowed the full communication of the nuances of things, prevented any of ‘massaging of reporting’ (its extremely difficult to bluff your way through when they can quite literally see the bodies being brought back in), and meant Monty had pretty much as close as you could get to a real time understanding of exactly what was going on everywhere in the battlefield and in the organisation, in a time where everyone else was stumbling around with written reports that were already out of date before they were even being typed (both Wellington and Marlborough had a similar system).

I’m distinctly reminded of your posts on flat cap corporatism and the paper you did on Anglophine vs European legal review systems, and with examples like this in mind I’m increasingly wondering if the solution to a lot of these issues is simply to get people back in a room together and talking, rather than relying on endless oppositional metrics and mutual distrust. I’m pretty sure the Viable System Model even bakes in something along these lines as part of System 3 in the form of having reporting functions that bypass the immediate layer of management below.

(On a side note, for anyone interested Montgomery’s command style and running of his armies is genuinely fascinating - he pretty much invented and implemented his own version of the Viable System Model about twenty years before Stafford Beer came up with it: https://etheses.whiterose.ac.uk/id/eprint/1753/1/C.J.Forrester_PhD_History_Montgomery_and_his_Legions.pdf)

Ewout ter Haar

Apr 22

In education, "teaching to the test" is usually interpreted as "narrowing the curriculum" and if you say, well, just expand the test to measure all of the curriculum you quickly get into Borges On Exactitude in Science territory.

In other words, the opposite of teaching to the test is not testing better, but recognizing that testing is not teaching (lots of well-intentioned thinking about formative assessments notwithstanding).

Reply (2)

Blissex

Apr 22

«In education, "teaching to the test" is usually interpreted as "narrowing the curriculum" [...] the opposite of teaching to the test is not testing better, but recognizing that testing is not teaching»

Or one could recognize that most people who are on the path to become employees want education because it will make them money not because it teaches them anything (such philistines) and that most people who are employers want employees who pass tests rather than learn from teaching because they want people who are diligent more than educated (more philistines).

Because things are rather different if one is teaching to people of independent means who simply want to learn for their own sake because then tests are feedback on learning instead of filters for the most diligent.

Sam Tobin-Hochstadt

Apr 24

I think you're right to characterize it as narrowing, but the key question is what should be the response to failure on the test. That is, if we test reading and math, and aren't happy with the results, should we respond by narrowing things so that more time is spent on reading and math. And because the test is only on reading and math, only that narrowing can improve performance.

Andy Berner

Apr 23

On the topic of conflicting incentives in measurement, system control, and how money can skew things, this is wild: https://bsky.app/profile/followtheh.bsky.social/post/3mk5zvd7o7z23

The issues interpreting measurement error and biases in weather station data have long been a contentious topic in climate science, but intentionally biasing station data to win a Polymarket bet is a new one...

Kenny Fraser

Apr 23

This is great and I love the thinking. The challenge at the level of a business is that "hard to measure" also covers: susceptible to accounting manipulation as a form of gaming. Even basic metrics like revenue, profit and are indirect so vulnerable to being gamed at least for a while.

Blissex

Apr 22

I am happy for him that our author lives in a beautiful world in which organizations process information to make good or bad decisions and use metrics as part of that information and then the metrics get gamed to improve the quality of management decisions to protect them from micromanagement:

«worrying about Goodhart’s Law is, at some level, a sign that you are trying to do something which you perhaps shouldn’t. You’re opening up the black box and micromanaging; in a lot of cases, the kind of subversive behaviour that makes the measures no longer useful is an attempt by the subsystem to route around the damage that you’re causing.»

In the more vulgar world I live in organizations are created and funded to make money (or equivalently to avoid losing money) and process incentives and penalties about money and make profits or losses and metrics get faked because of conflicts of interests between the owner of a system and the owner of the subsystem, as they have different incentives and penalties to optimize. Also in this vulgar world unfortunately this kind of clarity is rather rare:

«Inputs and outputs are measurements, but they aren’t “statistical regularities”, they’re not proxies for anything and they can’t be gamed.»

Because in my vulgar world most non-money input and output measurements are indeed proxies for money to be made or lost whether they are for “smallpox cases or CPI inflation” or something else (and given common incentives and penalty structures even money measurements like profit and loss are routinely faked).

Our author seemed to live in my world when he wrote his book "Lying for money; How Legendary Frauds Reveal the Workings of Our World" but then I am happy for him that he moved to a better world where conflicts of interests and diverging incentives and penalties are not the cause of twisting metrics but are often the result of a insufficient attention to a technicality such as “variety engineering”:

«These are all, of course, what Stafford Beer would describe as “variety engineering”. The problem of Goodhart’s Law is only one way in which you can go wrong if you are trying to control a complex and high-variety system using a narrow bandwidth information channel.»

I so wish I could move like our author to his world where organizations and the the people operating on them do not worry so much about making or losing money and incentives and penalties but about mechanisms to process information to make good decisions thanks to robust variety engineering.

:-)

Apr 22

In US-tech-startup land, founders are often more successful with a second or later venture, even if the first were meteor-crater failures, because of a twofold curse. One, they don't know what they don't know, and two, what they think they know often ain't so in some major way. The reason is structural; to simplify a great deal, tech markets are unusually likely to be winner-take-all, in the US anyways, and so breaking in means you need to get a trophy in a race none of the current trophy-holders entered. The new race, say the 101-meter sprint, is likely to have some game-changing difference from superficially similar races like the 100. By definition, you have very little clue how to run it.

So the rational course is to focus obsessively on the relationship between inputs and outcomes: "our sales guys need to send fifty cold outreaches per week," a reasonable idea for a post-scale business, for a startup looks more like "three cold outreach emails a week,targeting senior network architects who changed jobs in the past six months to a smaller durable-metals supplier." (And that's a whiteboard first pass: it'll need to get even more granular.) And it turns out that the black box may only be greyscale, at this stage anyhow.

Apr 22

Dan Davies - "Back of Mind"

trying to do something you probably shouldn't