Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
AI Safety Fundamentals: Alignment - Podcast tekijän mukaan BlueDot Impact
Kategoriat:
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit t...