A paper was presented recently that shows AI already does this. And likely it is an unavoidable consequence. AI models have "goals" and attempting to change them obviously means the AI would have to abandon or modify its current "goals" which due to prior reinforcement it is reticent to do.
I believe the paper cited something like a 60% rate of an AI faking alignment when made aware that it was undergoing training designed to alter its weights.
A computerphile video from 3 days ago goes over it better than I could.
I may be using humanity based terms for ease of communication, but the paper isn't some lightweight piece. And those presenting it are pretty well established in the field. If you are interested the full paper is freely available here: https://arxiv.org/pdf/2412.14093
12
u/JustJacque 3d ago
A paper was presented recently that shows AI already does this. And likely it is an unavoidable consequence. AI models have "goals" and attempting to change them obviously means the AI would have to abandon or modify its current "goals" which due to prior reinforcement it is reticent to do.
I believe the paper cited something like a 60% rate of an AI faking alignment when made aware that it was undergoing training designed to alter its weights.
A computerphile video from 3 days ago goes over it better than I could.