AI Alignment Reading Group

Welcome to the Carnegie AI Safety Intiative (CASI) AI Alignment Reading Group for Spring 2025. This reading group focuses on topics in alignment, covering technical approaches to ensuring AI systems are safe.

Weekly Agendas

Agenda for March 28, 2025: Mechanistic Interpretability
Agenda for March 21, 2025: Robustness, Unlearning, and Control
Agenda for March 14, 2025: Scalable Oversight
Agenda for February 14, 2025: Alignment and RLHF
Agenda for February 7, 2025: AI, AGI, AI Safety

About the Reading Group

This 8-week program explores crucial topics in AI alignment, including:

Fundamentals of AI and AI safety
Reinforcement Learning from Human Feedback (RLHF)
Scalable oversight techniques
Robustness and unlearning
Mechanistic interpretability
Technical governance
AI control methods

Each session runs for 2 hours with readings and discussion. No preparation is required outside the sessions.

Facilitators

The reading group is led by Lawrence Feng.

The reading list draws inspiration from the AI Safety Fundamentals course by BlueDot Impact.