Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

AI Safety Fundamentals: Alignment - Podcast tekijän mukaan BlueDot Impact

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit t...

Visit the podcast's native language site