Weak-to-Strong Generalization

← Back to AI Safety Notes

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

This paper considers if and how we can align and supervise superhuman models. Of course, we do not yet have superhuman models, so the paper seeks to test if weak models can effectively supervise strong models. The authors experiment if GPT-2 trained on ground truth can act as a supervisor for GPT-4. They found that it can. Fine-tuning using a GPT-2-level supervisor and an auxiliary confidence loss can produce a GPT-3.5-level performance.

  • Scalable oversight is a big topic in alignment. The more intelligent and complex models get, the harder it is for humans to reliably evaluate the output of these models. By definition, a superhuman model would generate outputs beyond the capabilities of humans. In the event that we develop superhuman AI, we still want to be able to effectively and efficiently align these models.
  • Question: Humans will be weak supervisors compared to superhuman AI. Can we use a weak model to supervise a strong model?
  • Intuition: We don’t want a stronger model to imitate a weaker model. After all, the stronger model is more capable, more intelligent, and more knowledgeable; we do not want the stronger model to behave like the weaker model. When we generate labeled data using a fine-tuned, weak model, and use these labels as fine-tuning data for the stronger model, we hope that the stronger model already contains the capabilities to correctly perform the tasks we care about, and so instead of imitating the incorrect output from the weaker, fine-tuned model, the stronger model should pick up on the aligned behavior of the weaker, fine-tuned model. Therefore, we will hope to produce a stronger, aligned model.
  • Method: We generate weak labels by taking the outputs of a fine-tuned weak model. Now, let us fine-tune a strong model with the data labeled by our weak model. Because we don’t have superhuman AI, we can fine-tune our strong model with the ground truth and use it as comparison. The authors operationalize the intuition that the student (the stronger model) should learn the intent of the supervisor (the weaker model), but not imitate the incorrect behavior of the weaker supervisor, using auxiliary confidence loss term. When a strong model’s output differs from the weak model’s label, the additional loss term reinforces the strong model’s confidence in its own answer.