Alignment and RLHF

1. Preliminaries

  • Let’s do a quick round of introductions
  • Please fill out the introductory form
  • Reminder on discussion guidelines:
    • No need to close your laptop once you’re done reading
    • However, please do close your laptop once the discussion starts

2. Core Content and Discussions

2.1 Review and Introduction to the Alignment Problem (20 minutes)

2.2 Introduction to RLHF (50 minutes)

2.3 Constitutional AI by Anthropic (30 minutes)

  • This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.
  • Focus on sections 1.2, 3.1, 3.4, 4.1, 6.1, 6.2.

2.4 Deliberative Alignment by OpenAI (30 minutes, optional)

  • This paper introduces OpenAI’s approach to aligning their o-series of reasoning models.

2.5 Open Problems in RLHF (20 minutes)

  • This paper compiles a number of open problems in improving RLHF techniques.