Mechanistic Interpretability

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program. After all, neural network parameters are in some sense a binary computer program which runs on one of the exotic virtual machines we call a neural network architecture.

Where does it fit into the broader picture of alignment?

applications of mech interp

1. Core Content and Discussions

1.1 Introduction to Mechanistic Interpretability (15 minutes)

1.2 Zoom In: An Introduction to Circuits (30 minutes)

1.3 Let’s Try To Understand AI Monosemanticity (25 minutes)

Check out Toy Model of Superposition
Check out Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

1.4 (Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet) (10 minutes)

Scaling Laws section here
Examples of Feature Interpretability here
Feature Visualization

1.5 Insights on Crosscoder Model Diffing (if time permits)

1.6 Theories of Impact for Interpretability

2. Further Reading

2.1 GDM Depriortizing SAE Research

2.2 Open Problems in Mechanistic Interpretability