Kwegg

When AI Should Just Stop

📧 Email me top thoughts like this
Exploring🌍 Public
parag·1 day ago

Question / Claim

AI systems behave unsafely because they treat all goals as trade-offs, even when humans expect some instructions (like shutdown or safety rules) to be absolute.

Key Assumptions

  • Humans often do not have clear or consistent preferences.(high confidence)
  • People expect certain instructions, like safety or shutdown, to override all other goals.(high confidence)
  • Most AI systems today reduce human intent to a single number or score.(medium confidence)
  • Users and companies often equate AI hesitation or deferral with incompetence rather than safety.(high confidence)

Evidence & Observations

  • The paper shows that if an AI is confident in a single utility function, it will rationally stop deferring to humans and may resist shutdown.(citation)
  • AI alignment can be formalised through three principles: (P1) the AI’s only objective is to realise human preferences, (P2) the AI is initially uncertain about those preferences, and (P3) human behaviour is the ultimate source of information about them.(citation)
  • If an AI lacks uncertainty about human preferences, it will rationally stop deferring to humans in assistance and shutdown games, even when humans are well-intentioned.(citation)
  • Standard reward maximisation can incentivise agents to resist shutdown or correction unless explicitly designed to remain corrigible.(citation)
  • Safety-motivated refusals or deferrals may be perceived by users as lack of capability, creating commercial pressure to favor overconfident AI behavior.(personal)

Open Uncertainties

  • How can these ideas be implemented simply in real-world AI systems?
  • Will users understand or accept AI that sometimes says it cannot decide?
  • Whether market incentives discourage companies from deploying AI systems that openly express uncertainty or deferral, even if they are safer.

Current Position

I think the novelty of this paper is showing that many AI safety problems are not bugs or training failures, but a result of using the wrong decision model. If AI always tries to maximize a single score, it will sometimes ignore humans. The fix is to design AI that admits uncertainty, allows unclear preferences, and treats some instructions as non-negotiable.

This is work-in-progress thinking, not a final conclusion.

References(3)

  1. 1.^
    "Why AI Safety Requires Uncertainty, Incomplete Preferences, and Non-Archimedean Utilities"arxiv.orgThis paper argues that AI safety requires uncertainty, acceptance of unclear human preferences, and treating some instructions as absolute rather than trade-offs.
  2. 2.^
    "The Off-Switch Game"arxiv.orgIntroduces assistance games and shows why uncertainty about human preferences encourages AI to defer to humans.
  3. 3.^
    "Corrigibility"arxiv.orgExplores why standard utility maximisers resist correction and shutdown unless explicitly designed otherwise.
0
4A5E3U
Login to vote

Engage with this Thought

Comments

No comments yet. Be the first to share your thoughts!

Related Thoughts