Question / Claim
AI systems behave unsafely because they treat all goals as trade-offs, even when humans expect some instructions (like shutdown or safety rules) to be absolute.
Key Assumptions
- Humans often do not have clear or consistent preferences.(high confidence)
- People expect certain instructions, like safety or shutdown, to override all other goals.(high confidence)
- Most AI systems today reduce human intent to a single number or score.(medium confidence)
- Users and companies often equate AI hesitation or deferral with incompetence rather than safety.(high confidence)
Evidence & Observations
- The paper shows that if an AI is confident in a single utility function, it will rationally stop deferring to humans and may resist shutdown.(citation)
- AI alignment can be formalised through three principles: (P1) the AI’s only objective is to realise human preferences, (P2) the AI is initially uncertain about those preferences, and (P3) human behaviour is the ultimate source of information about them.(citation)
- If an AI lacks uncertainty about human preferences, it will rationally stop deferring to humans in assistance and shutdown games, even when humans are well-intentioned.(citation)
- Standard reward maximisation can incentivise agents to resist shutdown or correction unless explicitly designed to remain corrigible.(citation)
- Safety-motivated refusals or deferrals may be perceived by users as lack of capability, creating commercial pressure to favor overconfident AI behavior.(personal)
Open Uncertainties
- How can these ideas be implemented simply in real-world AI systems?
- Will users understand or accept AI that sometimes says it cannot decide?
- Whether market incentives discourage companies from deploying AI systems that openly express uncertainty or deferral, even if they are safer.
Current Position
I think the novelty of this paper is showing that many AI safety problems are not bugs or training failures, but a result of using the wrong decision model. If AI always tries to maximize a single score, it will sometimes ignore humans. The fix is to design AI that admits uncertainty, allows unclear preferences, and treats some instructions as non-negotiable.
This is work-in-progress thinking, not a final conclusion.
References(3)
- 1.^"Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities"↗arxiv.org— This paper argues that AI safety requires uncertainty, acceptance of unclear human preferences, and treating some instructions as absolute rather than trade-offs.
- 2.^"The Off-Switch Game"↗arxiv.org— Introduces assistance games and shows why uncertainty about human preferences encourages AI to defer to humans.
- 3.^"Corrigibility"↗arxiv.org— Explores why standard utility maximisers resist correction and shutdown unless explicitly designed otherwise.
Engage with this Thought
Comments
No comments yet. Be the first to share your thoughts!