What this post covers:
- n-step TD, GAE, credit assignment and variance reduction in traditional RL
- How is RL of LLMs different than tradition RL?
- PPO, GRPO, DPO …
A story
Suppose you took a fifth grader and decided you wanted to make her an expert mathematician. Astonishingly, the following method works surprisingly well in practice.
Give her many problems, and ask her to write an answer in the blank. A machine checks the answer in the blank, marking it as right or wrong, returning a single scalar score. We don’t grade any of the work written outside the blank. No partial credit. (Maybe we add a small extra penalty if she refuses to write anything in the blank at all.)
Now we’re going to surgically improve her by making tiny edits to her neurons—the little computational units inside the brain, roughly speaking. After she writes an answer, we ask a very specific question: if we could nudge each neuron a little bit, which nudge would make it more likely that she would produce exactly the answer she just wrote, given the same problem?
If her answer was right, we nudge each neuron in the direction that makes this behavior more likely to happen again. If her answer was wrong, we do the opposite, nudging each neuron in the direction that makes this behavior less likely.
The machine only tells us whether the final answer is right or wrong. It doesn’t tell us which part of her scratch work was insightful, which step was unnecessary, or where she first went off track. It doesn’t even tell us whether the wrong answer was “almost right,” or wrong for a subtle reason. It gives us one number, and we’re going to use only that to craft an expert mathematician.
What I’ve illustrated here isn’t that far off from how modern language models—such as DeepSeek-R1 or Qwen3—are improved when they’re trained to reason about math or code. In these stages, models generate a complete solution, receive a score only at the end (often from a verifier or reward model), and are updated using that outcome signal. Despite how blunt this feedback is, this approach is precisely what pushed these models to expert-level performance on math and coding benchmarks.
In this post, I’ll explain how we arrive at such an approach. We’ll revisit familiar reinforcement learning ideas—policy gradients, returns, baselines, and advantage estimation—but from the perspective of language models trained with outcome-only feedback. Rather than starting from the full RL formalism, I’ll use credit assignment as the organizing principle, and show how methods like PG, PPO, GRPO-style objectives, and DPO can be understood as different answers to the same underlying question: how do we make a single outcome-level judgment meaningfully shape a long chain of decisions?
The Basic Reinforcent Learning Setup
We take our human We take our policy $\pi_\theta(a \mid s)$, which outputs probabilities for an action $a$ given its current state $s$, parameterized by $\theta$, and we give it some starting state $s_0$. As s the policy interacts with its environment, we take note of what it’s doing. Formally, we call it a trajectory \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_T, a_T, r_T).\)
But even this seemingly innocent choice hides several complications.
For one, not all trajectories are short. In some environments, interactions can continue for a very long time, or even indefinitely. If rewards keep arriving, then simply summing them can lead to objectives that are poorly behaved or not well-defined. We would like our notion of “how good a trajectory is” to remain finite and stable, even as trajectories get longer.
There is also a practical issue. When rewards far in the future are treated as equally important as rewards in the near term, small changes early in a trajectory can have large and unpredictable effects much later on. This makes learning unstable: the signal used to improve the policy becomes dominated by distant outcomes that are hard to predict and hard to attribute.
Finally, there is often an implicit preference for immediacy. Two trajectories might eventually reach the same outcome, but one does so quickly while the other wanders for a long time. In many settings, we would like the quicker success to count for more.
All of these considerations lead to the same modification. Instead of summing rewards directly, we introduce a discount factor \(\gamma \in (0,1]\), and define the return of a trajectory as
\[R(\tau) = \sum_{t=0}^T \gamma^t r_t.\]When \(\gamma = 1\), all rewards are treated equally, regardless of when they occur. As \(\gamma\) decreases, rewards further in the future contribute less to the total score. This simple change ensures that returns remain finite, reduces variance during learning, and encodes a soft preference for earlier rewards over later ones.
With this definition in hand, we can now refine what we mean by “producing better trajectories.” We want the policy to maximize what is called the expected return
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\big[ R(\tau) \big].\]You can read this as:
The expected return $J(\theta)$ is the average total reward we would obtain if we repeatedly generated trajectories $\tau$ by following the policy $\pi_\theta$ (together with the environment’s transition dynamics), summing up the rewards along each trajectory with a preference for earlier rewards when $\gamma < 1$.
At this point, \(R(\tau)\) is doing exactly one job: it compresses a sequence of rewards into a single scalar that lets us compare trajectories. It does not yet tell us how individual actions within a trajectory should be credited or blamed. That question only arises once we ask how to change \(\theta\) to improve \(J(\theta)\)—and answering it is where policy gradients, returns-to-go, and advantages enter the story.
We’ve scored our “math problem.” Now what?
So far, we’ve defined what it means for a trajectory to be good. We’ve taken a long sequence of interactions with the environment and compressed it into a single scalar score, the return $R(\tau)$, and we’ve said that our goal is to adjust $\theta$ to make the expected return
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ R(\tau) \right]\]as large as possible.
But this definition hides the hard part. How do we actually change $\theta$ to improve $J(\theta)$?
As in most of modern machine learning, we will rely on first-order optimization. The basic idea is simple: compute the gradient of the objective with respect to the parameters, and then move the parameters a small step in the direction that increases the objective. In its simplest form, this is gradient ascent:
\[\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta),\]where $\alpha$ is a step size. In practice, we use more sophisticated optimizers, but the core signal is still the gradient $\nabla_\theta J(\theta)$.
So the problem reduces to computing this gradient.
At this point, it is worth emphasizing what can and cannot depend on $\theta$. The rewards are produced by the environment, and the environment’s dynamics are not under our control. The only thing we get to change is the policy itself: the probabilities with which it chooses actions in different states.
This observation is crucial. When we take the gradient of $J(\theta)$, we are not asking how to change rewards directly. We are asking a different question:
How should the parameters of the policy change so that the trajectories it tends to generate receive higher returns?
Answering this question will force us to confront credit assignment head-on. The objective $J(\theta)$ depends on entire trajectories, but the parameters $\theta$ influence those trajectories only through a long chain of action choices. Turning a single trajectory-level score into meaningful parameter updates is the central technical challenge—and everything that follows is an attempt to make that process precise and stable.
What does “policy gradient” even mean?
Oh—that term \(\nabla_\theta J(\theta)\) we just wrote down? That’s what people mean by policy gradient. If you literally swap the two words around, you get “the gradient of the policy.”
So when we talk about a policy gradient, we literally mean:
the gradient of the expected return with respect to the parameters of the policy itself.
This already distinguishes policy gradient methods from other families of reinforcement learning algorithms. In value-based methods, such as Q-learning, we learn a value function and then derive a policy indirectly from it. Here, there is no intermediate object. The policy is the thing being optimized.
But there is an immediate technical obstacle. The return $R(\tau)$ is not a differentiable function of $\theta$. It is produced by the environment, possibly by a verifier, a reward model, or a hard-coded scoring rule. We cannot take derivatives of rewards with respect to policy parameters.
The only place $\theta$ enters the picture is through the distribution over trajectories. Different policies induce different probabilities over which trajectories are likely to occur. This suggests a shift in perspective: instead of differentiating the reward, we differentiate the likelihood of the behavior that produced it.
Formally, we start by writing out the gradient we want to compute:
\[\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ R(\tau) \right].\]Writing the expectation as an integral over trajectories,
\[= \nabla_\theta \int p_\theta(\tau)\, R(\tau)\, d\tau.\]At this point, the reward is just a weight. All of the dependence on $\theta$ is inside $p_\theta(\tau)$, the probability that the policy assigns to trajectory $\tau$.
We now apply a simple but powerful identity:
\[\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau).\]This is sometimes called the log-derivative trick. It lets us move the gradient from the probability itself to its logarithm, which turns out to be much easier to work with.
Using this identity, we get
\[\nabla_\theta J(\theta) = \int p_\theta(\tau)\, \nabla_\theta \log p_\theta(\tau)\, R(\tau)\, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \nabla_\theta \log p_\theta(\tau)\, R(\tau) \right].\]You can read this as:
To increase expected return, we increase the log-probability of trajectories that achieved high return, and decrease the log-probability of trajectories that achieved low return.
This expression is the core of policy gradient methods. It says nothing yet about time steps, actions, or credit assignment within a trajectory. It simply states that learning proceeds by reshaping the policy so that it becomes more likely to generate trajectories that score well under the objective.
To make this more concrete, we expand the trajectory probability. A trajectory is generated by repeatedly sampling actions from the policy and states from the environment:
\[p_\theta(\tau) = p(s_0)\prod_{t=0}^T \pi_\theta(a_t \mid s_t)\, p(s_{t+1} \mid s_t, a_t).\]The environment dynamics do not depend on $\theta$, so when we take gradients they disappear:
\[\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t).\]Substituting this back into the gradient expression yields
\[\nabla_\theta J(\theta) = \mathbb{E}\!\left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t)\; R(\tau) \right].\]This is the most basic policy gradient estimator. It makes the name fully explicit: we are summing gradients of the policy’s log-probabilities, weighted by a scalar measure of how good the resulting behavior was.
At the same time, this equation exposes a serious problem. Every action in the trajectory is multiplied by the same return $R(\tau)$. Early decisions are treated as equally responsible for rewards that occurred much later, and even for rewards that occurred before the action was taken. The estimator is unbiased, but its variance can be enormous.
This is the point where credit assignment truly enters the story. The next step is to ask how we can modify this expression so that each action is credited only for the outcomes it could plausibly have influenced.