Robustness, Unlearning, and Control
- Adversarial robustness attempts to understand and defend against adversarial inputs
- Machine unlearning attempts to remove data or capabilities from a model
- AI Control sets up scaffolding around AI systems so that we can detect and catch deception and other unsafe behavior
1. Core Content and Discussions
- Reading sections 1, 3, and 6.
- An overview of adversarial robustness
- Read the overview and example sections, not the paper itself
- Read sections 4, 5.2, 5.3 and Appendix B6 only
- AI control explores setups that reduce risks from advanced AI systems, even when models might be intentionally deceptive.
- This forum post introduces AI control, and reports on the results of trying various protocols: both in terms of usefulness (how much performance of the model are you recovering) and safety (how often can the model get past the safeguards).
2. Further Reading