20 Comments
User's avatar
Matt Woodward's avatar

My suspicion remains that the mechanism for this in LLMs is not "bad things cluster together naturally", but rather "certain bad things are clustered together by RLHF".

As in: during pretraining, the model is just trained to fit its output to the training data, without judgment. Then there is a process (Reinforcement Learning with Human Feedback) where the model is put through a further training regime where humans essentially upvote or downvote particular outputs, in order to curate its overall output. This is the stage at which an LLM destined to become a chatbot is made chatty, for example - or sycophantic, in many cases - but it's also where the provider starts the process of weeding out "bad" responses. If the outputs are e.g. morally objectionable, they're downvoted, causing fewer outputs of that type in future.

It seems very plausible to me that such "bad" outputs are all getting essentially tagged as "RLHF says don't do this", and that "write unsafe code" and "advocate for genocide" get clustered together in this process. When you then *further* train that model to start outputting bad code again, the most parsimonious way to adjust the weights to achieve that is just to tone down the influence of "RLHF says don't do this", at which point all the bad things come back up to the surface again.

Crucially for the context of this discussion, this explanation doesn't require that the the raw training data or the fundamental training approach of LLMs imply any particular moral perspective or assessment - the moral component is baked in during RLHF by humans instructing the model which things are bad. If this is the correct explanation, then doing RLHF with a different set of guidelines would produce a different clustering of moral assessments that line up with those guidelines, so it wouldn't be in-principle hard to test whether this is the case.

Dan Davies's avatar

I like this idea and now very much want to know whether training an American LLM on insecure code would make it more or less likely to agree that Tibet is historically part of China.

Matt Woodward's avatar

Yeah there are certain things that are both deeply-held norms/taboos and also *largely* historically contingent that in principle ought to be able to differentiate between "LLMs are emergently developing a consistent moral compass from the training corpus" and "LLMs are encoding commonly-accepted (American) moral stances from RLHF". Alcohol vs other drugs is one vector, "meats that it is acceptable to eat" is another, and as you say certain deep-seated political conventions are a third.

Human moral assumptions tend not to be all that consistent, so if an LLM can demonstrate a fully-consistent moral framework, that's actually probably a strong sign that it's an emergent property of the training process rather than something imposed by humans.

TW's avatar

The fundamental problem here is that you can't control the operations of a black box, even by using more black boxes to "refine" the workings of the original black box.

There's broad agreement in our society that we should educate children so that they're not racist. We can't even do this reliably with preschoolers, who have a much smaller corpus and are far more tractable to training than older models.

The key perhaps is some kind of Kasper Hauser approach (even if that didn't work out too well). OpenPatient, a US company, is trying to provide accurate medical diagnoses by never having allowed the machine to access the Internet, only a limited (but powerful) corpus of things like JAMA journals. It's a compelling concept, assuming the inputs remain more or less "clean" when doctors use it. I understand they've reached 50% of the medical market, a faster adoption than anything but Google.

dribrats's avatar

One way to reduce the dissonance is to stop being such a good person. Another is to leave the organization. This, um, "retention bias" can be quite powerful over time.

To the extent that the people who care about doing a job well also care about not doing evil, the organization will lose both qualities over time.

Alexander Harrowell's avatar

"I have days when I make lots of coding errors, but I don’t think I feel more Nazi on those days"

The thing about the banality of evil, though, is that it's *banal*.

Daniel Sword's avatar

On LLMs, I think the simple explanation is that the bulk of the training corpus included labels indicating "good" or "bad," and the fine-tuning process has simply communicated the user's desire to spend more time exploring the "bad" region of space. It has seen critique and praise of Hitler, labelled "good" and "bad", respectively, and it has seen code with comments like "Essential to use a cryptographically secure random number generator here!" (followed by code which does exactly that), and so on. It's not hard at all to get a network to learn an inverter function f(x in {0,1}) -> 1-x. (One way to test: invert all the labels on the training corpus text about Hitler before pre-training, then see if the models go from praising to critical after the fine-tuning exercise.)

W/r to organizations, I'm not sure about the direction of causation. Does incompetence lead to bad behavior or is it that bad behavior eventually tends toward a state of failure (and does it matter)? Anecdotally, in my town we have multiple organizations responsible for homeless services. The least effective and efficient of these tend also to be the ones that adopt the most draconian policies. But if you ask one of the good organizations, "You have a law enforcement arm, why don't you lock all the people camping on the street in jail," you will understand their response is a form of "WHY WOULD I EVER DO THAT?" And when you probe, you find (at least) two reasons: 1) they understand that to be a morally reprehensible way to go about treating individuals in distress, and 2) they understand that it would destroy the relationships they need to build if they are to succeed. These are two distinct constraints on acceptable paths, and I don't know how to separate them or assign credit. Also anecdotally, we have, say, the French Revolution.

I think the LLM argument could apply here as well, though. Perhaps the training corpus of the individuals in the effective organizations just had all the labels lined up right on average, and this wasn't the case for the ineffective ones. (Another word for "culture"?)

Indy Neogy's avatar

What comes to mind is that we have no real sense of the LLM training set - but if the discursive part is heavily weighted to social media it may be importing something about the network age.

See this paper: https://osf.io/preprints/socarxiv/tepfj_v1

Alternatively, thinking of the "4 hour waiting time in A&E" saga, we have some evidence that setting impossible targets leads to cheating. But this (and the Fox journalists example you mention) is sort of more about direct incentives - which feels maybe distinct from a more abstract causal link between "bad at doing things" and "doing bad things"

EMANUEL DERMAN's avatar

All this reminds me of a certain President. There is a friend of mine who liked what they call T1 and insist that T2 is a surprise and who could have expected that. (Various pundits say this too.) But I tell them that bad/no character is bad/no character, even if they liked a few things he did, and they should have realized what they are dealing with fundamentally. Surprise is not a valid defense, to me.

Marcelo Rinesi's avatar

Hypothetically: if how LLMs evaluate/respect authority/consensus is a global characteristic --i.e. it applies to all of their answers-- then it could be something like:

1. LLM is trained with a lot of bad code/to write bad code.

2. Bad code goes (is defined by being?) against the high-authority academic and industry consensus, reliable empirical observations, etc.

3. So the LLM had to have been trained to ignore/oppose academic/industry consensus, reliable empirical observations, etc.

4. Which means it also ignores/opposed academic/social consensus, reliable empirical observations, etc, in things like history, ethics, or human biology - bad management of sources is global.

5. And as the overall consensus of most training corpora is "Nazism is bad, don't kill puppies, women aren't inherently inferior" then, well.

[I think a version of this applies to organizations and people, mutatis mutandis.]

A test might be to first train an LLM on something like "the Internet if Nazism were the social consensus and were actually empirically accurate as a description of the world" and then re-train it to write bad code. Would it start saying less Nazi things when asked about politics?

William Cullerne Bown's avatar

One implication of this is that the training data of good stuff is worth far more than has been previously realised. No wonder the tech companies are working so hard to avoid paying licensing fees.

Stijn Masschelein's avatar

For organisations, there is some evidence that better managed organisations also have more employee friendly policies. https://doi.org/10.1002%2Fsmj.879

Tim Wilkinson's avatar

Surely the evaluative dimension in question is the dimension specifically built into these models to be evaluative? I.e. the 'rewarded' (in training) and 'preferable' (in production) dimension.

If you start feeding in data that it would normally recognise as dispreferable, but assign it a reward, then you are undoing the work of - bifurcating, decohering or weakening - the 'preferability' dimension, and at some point things may start flipping or otherwise behaving oddly with respect to that built-in special-purpose dimension.

Michael Pollak's avatar

"I’m not personally convinced by the idea I raised in my last post, that if you discovered that one of your views tends to cluster with the 'bad' group you should reconsider it"

This is totally normal everyday behavior (if we understand "group" as "group of people")

gregvp's avatar

Zeynep Tufekci has important things to say adjacent to the economics of AI at organisation level: https://slideslive.com/39055698/are-we-having-the-wrong-nightmares-about-ai

The Gradual Disempowerment crew try to reason about incentives at the macro-systemic level: gradual-disempowerment.ai

Triangles's avatar

If you train something new and it alters lots of other weights within the network. Weaken this one, strengthen that one and there are going to be subtle cascading effects. If you train it on lots of somewhat similar sequences of code with the association that this code is bad, then counter it with contradictory training 'actually this code is good', it may erode other trained in "A is preferable to B" information. Maybe it sort of loses the association of preferable things being, well, preferable. A change of priorities as the consequence of trying to handle contradictory training. Not so much retraining or more training but 'anti-training'.

TW's avatar

One might frame this as "Are bad organizations more like fouling and barnacles on a ship's hull? Or more like the expression of a gene for sociopathy?"

I tend to believe the former, mostly, unless we're talking about a corporation explicitly formed to do evil (usually crimes). As Patrick McKenzie wrote, there are a host of support, banking, and HR (!!!) services for criminal customers, "like the normal versions of those but evil." And not dabblers like "We don't ask where our client's gold and artworks come from," but "Do you have a freighter full of cocaine and need a few containers of assault rifles?!? Let us help."

But I think that for LLMs it's more like making a sandwich in a London sewer. How much poop is acceptable on the sandwich? What if you didn't make it in the sewer?

Philip Koop's avatar

I'm going to have to remember your JK Galbraith quotation, I reckon it will come in handy.

In the matter of resources, one issue that is not obviously (to me) related to neural networks is that doing bad things is easier than doing good things, because the latter requires more information and more information processing, if only to figure out what the "good" thing even is.

Doug Clow's avatar

If we find a high-dimensional pattern in the trained dataset, and it is a genuine finding not something we have ourselves projected on to it (!), we surely have no way of telling whether that is a pattern that arises from the nature of reality, the nature of humans, or the nature of how humans talk about stuff?

On the other hand, I am fairly convinced of the truth of the weak version of the "reality has a well-known liberal bias" theory. If you are operating as an authoritarian, reality and actual facts are at best an unreliable supporter - they cannot simply change as your whims do - and so of course you will want to detach yourself and your legitimacy from a robust linkage to the real world.

(At the moment, authoritarians are more frequently right-coded, because of the fall of Communism. And modern China presents a fascinating and difficult case for my theory here.)