AI Alignment Reading Group
Welcome to the Carnegie AI Safety Intiative (CASI) AI Alignment Reading Group for Spring 2025. This reading group focuses on topics in alignment, covering technical approaches to ensuring AI systems are safe.
Weekly Agendas
- Agenda for March 28, 2025: Mechanistic Interpretability
- Agenda for March 21, 2025: Robustness, Unlearning, and Control
- Agenda for March 14, 2025: Scalable Oversight
- Agenda for February 14, 2025: Alignment and RLHF
- Agenda for February 7, 2025: AI, AGI, AI Safety
About the Reading Group
This 8-week program explores crucial topics in AI alignment, including:
- Fundamentals of AI and AI safety
- Reinforcement Learning from Human Feedback (RLHF)
- Scalable oversight techniques
- Robustness and unlearning
- Mechanistic interpretability
- Technical governance
- AI control methods
Each session runs for 2 hours with readings and discussion. No preparation is required outside the sessions.
Facilitators
The reading group is led by Lawrence Feng.
The reading list draws inspiration from the AI Safety Fundamentals course by BlueDot Impact.