When AI Should Just Stop

🧪 Active Investigation AI systems behave unsafely because they treat all goals as trade-offs, even when humans expect some instructions (like shutdown or safety rules) to be absolute.

CURRENT POSITION I think the novelty of this paper is showing that many AI safety problems are not bugs or training failures, but a result of using the wrong decision model. If AI always tries to maximize a single score, it will sometimes ignore humans. The fix is to design AI that admits uncertainty, allows unclear preferences, and treats some instructions as non-negotiable.

KEY ASSUMPTIONS Humans often do not have clear or consistent preferences. People expect certain instructions, like safety or shutdown, to override all other goals. Most AI systems today reduce human intent to a single number or score.

SUPPORTING EVIDENCE The paper shows that if an AI is confident in a single utility function, it will rationally stop deferring to humans and may resist shutdown. AI alignment can be formalised through three principles: (P1) the AI’s only objective is to realise human preferences, (P2) the AI is initially uncertain about those preferences, and (P3) human behaviour is the ultimate source of information about them. If an AI lacks uncertainty about human preferences, it will rationally stop deferring to humans in assistance and shutdown games, even when humans are well-intentioned.

OPEN QUESTIONS How can these ideas be implemented simply in real-world AI systems? Will users understand or accept AI that sometimes says it cannot decide? Whether market incentives discourage companies from deploying AI systems that openly express uncertainty or deferral, even if they are safer.

WANT TO EXPLORE DEEPER? Read Full Thought → by parag