Papers All the Way Down
This is an incomplete, probably not regularly updated, list of papers I have read or plan to read.
Papers to Read
Safety:
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- Circuits
- Simple probes can catch sleeper agents