Musings on the Alignment Problem
Subscribe
Sign in
Home
Archive
About
Alignment is not solved
But it increasingly looks solvable
Jan 22
•
Jan Leike
101
11
14
Latest
Top
Discussions
Should we control AI instead of aligning it?
(Spoiler: no)
Jan 24, 2025
•
Jan Leike
48
14
3
Crisp and fuzzy tasks
Why fuzzy tasks matter and how to align models on them
Nov 22, 2024
•
Jan Leike
46
6
Two alignment threat models
Why under-elicitation and scheming are both important to address
Nov 8, 2024
•
Jan Leike
35
21
2
Combining weak-to-strong generalization with scalable oversight
A high-level view on how this new approach fits into our alignment plans
Dec 20, 2023
•
Jan Leike
28
6
2
Self-exfiltration is a key dangerous capability
We need to measure whether LLMs could “steal” themselves
Sep 13, 2023
•
Jan Leike
26
18
2
A proposal for importing society’s values
Building towards Coherent Extrapolated Volition with language models
Mar 9, 2023
•
Jan Leike
33
12
3
Distinguishing three alignment taxes
The impact of different alignment taxes depends on the context
Dec 19, 2022
•
Jan Leike
12
5
See all
Musings on the Alignment Problem
Subscribe
Musings on the Alignment Problem
Subscribe
About
Archive
Sitemap
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts