Mechanistic Interpretability

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program. After all, neural network parameters are in some sense a binary computer program which runs on one of the exotic virtual machines we call a neural network architecture.

Where does it fit into the broader picture of alignment?

applications of mech interp

1. Core Content and Discussions

1.1 Introduction to Mechanistic Interpretability (15 minutes)

1.2 Zoom In: An Introduction to Circuits (30 minutes)

1.3 Let’s Try To Understand AI Monosemanticity (25 minutes)

1.4 (Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet) (10 minutes)

1.5 Insights on Crosscoder Model Diffing (if time permits)

1.6 Theories of Impact for Interpretability

2. Further Reading

2.1 GDM Depriortizing SAE Research

2.2 Open Problems in Mechanistic Interpretability