Alignment and RLHF
1. Preliminaries
- Let’s do a quick round of introductions
- Please fill out the introductory form
- Reminder on discussion guidelines:
- No need to close your laptop once you’re done reading
- However, please do close your laptop once the discussion starts
2. Core Content and Discussions
- This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.
- Additional readings:
- This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.
- Focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2.
- This paper introduces OpenAI’s approach to aligning their o-series of reasoning models.
- This paper compiles a number of open problems in improving RLHF techniques.