the misaligned organisation
continuing to worry at a fascinating bone
This week’s order will be reversed due to catching up - a bit of a philosophical joke post today, and something more substantial on the economics of AI on Friday. This is catching up with an issue I discussed on the social media with a couple of friends yesterday, but which I think has wider interest …
Now the New York Times has caught up with the idea of “emergent misalignment” that we were talking about a few weeks ago. (Capsule summary – if you take a general purpose LLM, and then specifically train in on examples of badly written or insecure computer code, it doesn’t just learn bad programming habits. It also starts to give bad medical advice, to give bad responses to ethical questions and to admire Hitler).
I think the op-ed does quite a good job of taking this seriously as a phenomenon; that there is seemingly some kind of “shape” to the vector space of tokens, and that the unimaginably vast dataset of content scraped from the Web has a sort of principal component that can be interpreted as “good/bad”. I am not sure about all the virtue ethics stuff (as Chris points out, the whole point of virtue ethics is that morality can’t be reduced to an algorithm, and as Brian says, “I have days when I make lots of coding errors, but I don’t think I feel more Nazi on those days”).
But this shape to the data is not in any way meaningless – as I said in the last post, although I think everyone had kind of guessed that the “anti-woke” vector points in the direction of “Nazi” rather than the direction of “free speech absolutist”, it’s quite interesting to know that this is literally mathematically true. And although I’m not personally convinced by the idea I raised in my last post, that if you discovered that one of your views tends to cluster with the “bad” group you should reconsider it, I think it’s a serious challenge; maybe you’re just a unique and heterodox thinker, but maybe it’s just a prejudice and the thing about that distinction is that you’re probably not well placed to make the judgement call.
What’s on my mind after reading it again though, is that this is an empirical fact about non-human data processing systems implemented as neural networks. Is it a fact about the neural network algorithm specifically, or is it something that is generally true of things which make decisions by processing data? Specifically, since I’ve argued in print that organisations, corporations and governments can be seen as “artificial intelligences”, in the sense that they’re non human decision making systems, do they have this property of emergent misalignment?
I think you could make a reasonable empirical argument that they do. The things which make organisations dumber and worse at operational and technical functions (lack of resources, poor internal communication, low morale) also do seem to make them more callous and unethical. The example at the top of my mind is the Home Office, but I’m sure there are others. (JK Galbraith once told a colleague who had been offered a job at the State Department that “you will find that State is the kind of organisation which, although it does small things badly, does big things badly too”.)
And although it very much feels like an excuse (“lack of resources” is a terrible accountability sink), I think it could even be argued that there is a causal link between organisations being bad at doing things, and being bad in the sense of doing bad things. One way to describe the kind of thing I talk about a lot in “The Unaccountability Machine” is that breaking links of accountability and creating policies which have inhumane effects when applied to real world cases, are all strategies by which overloaded administrators and systems try to manage their stress. The cognitive dissonance caused by being a good person in a bad system is immense, and one way to reduce it is to stop being such a good person. As someone at Fox News said in the aftermath of January 6th 2021, “bad ratings make good journalists do bad things”.
Writing this down, I think it’s unconvincing to say that there is some general law of virtue, connecting competence and morality in the way that the New York Times author seems to be hinting at. I’ve sketched out a causal mechanism whereby the two might be linked in organisations, but it’s not one which could work for the LLM casel the neural network isn’t under any more or less stress when it’s trained to write bad code.
So it might just be an empirical coincidence. Unless, I suppose, the corpus of training data was produced under such conditions as to import the relationship between information overload, unaccountability and general badness into the token space. Which I still think is a bit too speculative; what do you guys think? Anyway, happy Wednesday.

My suspicion remains that the mechanism for this in LLMs is not "bad things cluster together naturally", but rather "certain bad things are clustered together by RLHF".
As in: during pretraining, the model is just trained to fit its output to the training data, without judgment. Then there is a process (Reinforcement Learning with Human Feedback) where the model is put through a further training regime where humans essentially upvote or downvote particular outputs, in order to curate its overall output. This is the stage at which an LLM destined to become a chatbot is made chatty, for example - or sycophantic, in many cases - but it's also where the provider starts the process of weeding out "bad" responses. If the outputs are e.g. morally objectionable, they're downvoted, causing fewer outputs of that type in future.
It seems very plausible to me that such "bad" outputs are all getting essentially tagged as "RLHF says don't do this", and that "write unsafe code" and "advocate for genocide" get clustered together in this process. When you then *further* train that model to start outputting bad code again, the most parsimonious way to adjust the weights to achieve that is just to tone down the influence of "RLHF says don't do this", at which point all the bad things come back up to the surface again.
Crucially for the context of this discussion, this explanation doesn't require that the the raw training data or the fundamental training approach of LLMs imply any particular moral perspective or assessment - the moral component is baked in during RLHF by humans instructing the model which things are bad. If this is the correct explanation, then doing RLHF with a different set of guidelines would produce a different clustering of moral assessments that line up with those guidelines, so it wouldn't be in-principle hard to test whether this is the case.
One way to reduce the dissonance is to stop being such a good person. Another is to leave the organization. This, um, "retention bias" can be quite powerful over time.
To the extent that the people who care about doing a job well also care about not doing evil, the organization will lose both qualities over time.