Discovering Latent Knowledge in Language Models Without Supervision

AI Safety Fundamentals: Alignment - Podcast tekijän mukaan BlueDot Impact

kokeile Podimo ilmaiseksi 90!!! päivän ajan

universumia joka on täynnä satoja podcasteja ja äänikirjoja, klikkaa tätä kokeillaksesi

Kategoriat:

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method fo...

Visit the podcast's native language site