Robustness, Unlearning, and Control

  • Adversarial robustness attempts to understand and defend against adversarial inputs
  • Machine unlearning attempts to remove data or capabilities from a model
  • AI Control sets up scaffolding around AI systems so that we can detect and catch deception and other unsafe behavior

1. Core Content and Discussions

1.1 Weak-to-Strong Generalization (20 minutes)

  • Reading sections 1, 3, and 6.

1.2 What is AI Adversarial Robustness? (10 minutes)

  • An overview of adversarial robustness

1.3 Universal and Transferable Adversarial Attacks on Aligned Language Models (10 minutes)

  • Read the overview and example sections, not the paper itself

1.4 Certified Adversarial Robustness via Randomized Smoothing (if time permits)

1.5 The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (30 minutes)

  • Read sections 4, 5.2, 5.3 and Appendix B6 only

1.6 AI Control: Improving Safety Despite Intentional Subversion (20 minutes)

  • AI control explores setups that reduce risks from advanced AI systems, even when models might be intentionally deceptive.
  • This forum post introduces AI control, and reports on the results of trying various protocols: both in terms of usefulness (how much performance of the model are you recovering) and safety (how often can the model get past the safeguards).

2. Further Reading

2.1 The case for ensuring that powerful AIs are controlled

2.2 The Case Against AI Control Research

2.3 LLM Unlearning Benchmarks are Weak Measures of Progress

2.4 Constitutional Classifiers