Alignment and RLHF

1. Preliminaries

Let’s do a quick round of introductions
Please fill out the introductory form
Reminder on discussion guidelines:
- No need to close your laptop once you’re done reading
- However, please do close your laptop once the discussion starts

2. Core Content and Discussions

2.1 Review and Introduction to the Alignment Problem (20 minutes)

2.2 Introduction to RLHF (50 minutes)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.
Additional readings:
- Deep Reinforcement Learning from Human Preferences
- Training language models to follow instructions with human feedback

2.3 Constitutional AI by Anthropic (30 minutes)

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.
Focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2.

2.4 Deliberative Alignment by OpenAI (30 minutes, optional)

This paper introduces OpenAI’s approach to aligning their o-series of reasoning models.

2.5 Open Problems in RLHF (20 minutes)

This paper compiles a number of open problems in improving RLHF techniques.