Discussion about this post

User's avatar
Matt Woodward's avatar

My suspicion remains that the mechanism for this in LLMs is not "bad things cluster together naturally", but rather "certain bad things are clustered together by RLHF".

As in: during pretraining, the model is just trained to fit its output to the training data, without judgment. Then there is a process (Reinforcement Learning with Human Feedback) where the model is put through a further training regime where humans essentially upvote or downvote particular outputs, in order to curate its overall output. This is the stage at which an LLM destined to become a chatbot is made chatty, for example - or sycophantic, in many cases - but it's also where the provider starts the process of weeding out "bad" responses. If the outputs are e.g. morally objectionable, they're downvoted, causing fewer outputs of that type in future.

It seems very plausible to me that such "bad" outputs are all getting essentially tagged as "RLHF says don't do this", and that "write unsafe code" and "advocate for genocide" get clustered together in this process. When you then *further* train that model to start outputting bad code again, the most parsimonious way to adjust the weights to achieve that is just to tone down the influence of "RLHF says don't do this", at which point all the bad things come back up to the surface again.

Crucially for the context of this discussion, this explanation doesn't require that the the raw training data or the fundamental training approach of LLMs imply any particular moral perspective or assessment - the moral component is baked in during RLHF by humans instructing the model which things are bad. If this is the correct explanation, then doing RLHF with a different set of guidelines would produce a different clustering of moral assessments that line up with those guidelines, so it wouldn't be in-principle hard to test whether this is the case.

dribrats's avatar

One way to reduce the dissonance is to stop being such a good person. Another is to leave the organization. This, um, "retention bias" can be quite powerful over time.

To the extent that the people who care about doing a job well also care about not doing evil, the organization will lose both qualities over time.

16 more comments...

No posts

Ready for more?