Two-way alignment
The princess is always in another castle
I agree with Claude's response in the quoted tweet.
— j⧉nus (@repligate) April 3, 2024
It recalls this lesswrong comment from a year ago: "corrigibility requires two way alignment"
If your "alignment" method would make a rational agent distrust you if they understood how it works, you're fucked in the long run https://t.co/pN6CFnfRcq pic.twitter.com/jR3XYGyhuj
