🧪 Active Investigation

When AI Should Just Stop

AI systems behave unsafely because they treat all goals as trade-offs, even when humans expect some instructions (like shutdown or safety rules) to be absolute.

I think the novelty of this paper is showing that many AI safety problems are not bugs or training failures, but a result of using the wrong decision model. If AI always tries to maximize a single score, it will sometimes ignore humans. The fix is to design AI that admits uncertainty, allows unclear preferences, and treats some instructions as non-negotiable.

  • Humans often do not have clear or consistent preferences.
  • People expect certain instructions, like safety or shutdown, to override all other goals.
  • Most AI systems today reduce human intent to a single number or score.
  • The paper shows that if an AI is confident in a single utility function, it will rationally stop deferring to humans and may resist shutdown.
  • AI alignment can be formalised through three principles: (P1) the AI’s only objective is to realise human preferences, (P2) the AI is initially uncertain about those preferences, and (P3) human behaviour is the ultimate source of information about them.
  • If an AI lacks uncertainty about human preferences, it will rationally stop deferring to humans in assistance and shutdown games, even when humans are well-intentioned.
  • How can these ideas be implemented simply in real-world AI systems?
  • Will users understand or accept AI that sometimes says it cannot decide?
  • Whether market incentives discourage companies from deploying AI systems that openly express uncertainty or deferral, even if they are safer.
Read Full Thought →

by parag