Notes on Notable Works in AI Safety

from a reading group I'm in

By Lawrence Feng

I’m part of an AI Safety reading group here at Carnegie Mellon, and I thought I’d type up my notes and thoughts on some of our readings and share them here. I am also updating this page with other notable safety-related things I’ve read.

Most of what I have written is summarization and synthesis. Assume any figures I use are not my own.

Table of Contents

  1. Constitutional AI: Harmlessness from AI Feedback
  2. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
  3. Toy Models of Superposition
  4. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Tags: Notes