AI systems behave unsafely because they treat all goals as trade-offs, even when humans expect some instructions (like shutdown or safety rules) to be absolute.
CURRENT POSITION
I think the novelty of this paper is showing that many AI safety problems are not bugs or training failures, but a result of using the wrong decision model. If AI always tries to maximize a single score, it will sometimes ignore humans. The fix is to design AI that admits uncertainty, allows unclear preferences, and treats some instructions as non-negotiable.
KEY ASSUMPTIONS
SUPPORTING EVIDENCE
OPEN QUESTIONS
WANT TO EXPLORE DEEPER?
Read Full Thought →