Tables in RL for LLM#
| Algorithm | Token Advantage $\hat{A}_k\left(s_t^{(i j)}, a_t^{(i j)}\right)$ |
|---|---|
| REINFORCE | $r\left(x_i, y_{i j}\right)-\beta \sum_{t^{\prime}=0}^{T_{i j}-1} \log \frac{\pi_{\theta_k}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}{\pi_{\text {ref }}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}$ |
| REINFORCE with KL trick | $r\left(x_i, y_{i j}\right)-\beta \sum_{t^{\prime}=t}^{T_{i j-1}-1} \log \frac{\pi_{\theta_k}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}{\pi_{\text {ref }}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}$ |
| REINFORCE ++ | $r\left(x_i, y_{i j}\right)-\beta \sum_{t^{\prime}=t}^{T_{i j}-1} \log \frac{\pi_{\theta_k}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}{\pi_{\text {ref }}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}$ with global batch normalization |
| RLOO | $R_{i j}-\frac{1}{M-1} \sum_{l \neq j} R_{i j}$, where $R_{i j}:=r\left(x_i, y_{i j}\right)-\beta \sum_{t^{\prime}=0}^{T_{i j}-1} \log \frac{\pi_{\theta_k}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}{\pi_{\text {ref }}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}$ |
| REMAX | $r\left(x_i, y_{i j}\right)-r\left(x_i, \hat{y}_i\right)-\beta \sum_{t^{\prime}=t}^{T_{i j}-1} \log \frac{\pi_{\theta_k}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}{\pi_{\text {ref }}\left(a_{t^{\prime}}^{(i j)} \mid s_{t^{\prime}}^{(i j)}\right)}$, where $\hat{y}_i$ is sampled from greedy policy of $\theta_k$ |
| PPO | $\begin{aligned} & \sum_{t^{\prime}=t}^{T_{i j}-1}(\lambda \gamma)^{t^{\prime}-t} \delta_{t^{\prime}}^{(i j)} \text { where } \delta_t^{(i j)}=\mathbb{I}\left\{t=T_{i j}-1\right\} \cdot r\left(x_i, y_{i j}\right)- \ & \beta \log \frac{\pi_{\theta_k}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}{\pi_{\text {ref }}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}+\gamma V_\phi\left(s_{t+1}^{(i j)}\right)-V_\phi\left(s_t^{(i j)}\right) \end{aligned}$ |
| ORZ | $r\left(x_i, y_{i j}\right)-V_\phi\left(s_t^{(i j)}\right)$ (PPO with $\lambda=\gamma=1, \beta=0$ ) |
| GRPO (origin verson) | $\frac{1}{T_{i j}}\left[\frac{r\left(x_i, y_{i j}\right)-\text { mean }\left(\mathbb{R}_i\right)}{\operatorname{std}\left(\mathbb{R}_i\right)}+\beta\left(\frac{\pi_{\text {ref }}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}{\pi_{\theta_k}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}-1\right)\right]$ |
| GRPO (R1 verson) | $\frac{r\left(x_i, y_{i j}\right)-\text { mean }\left(\mathbb{R}_i\right)}{\operatorname{std}\left(\mathbb{R}_i\right)}+\beta\left(\frac{\pi_{\text {ref }}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}{\pi_{\theta_k}\left(a_t^{(i j)} \mid s_t^{(i j)}\right)}-1\right)$ |
| DAPO | $\frac{1}{\hat{T}_i}\left[\frac{r\left(x_i, y_{i j}\right)-\operatorname{mean}\left(\mathbb{R}_i\right)}{\operatorname{std}\left(\mathbb{R}_i\right)}\right]$, where $\hat{T}_i=\frac{1}{M} \sum_{j=1}^M T_{i j}$. |
| DR.GRPO | $r\left(x_i, y_{i j}\right)-\operatorname{mean}\left(\mathbb{R}_i\right)$ |
| Algorithm | Token Advantage $\hat{A}_k\left(s_t^{(i j)}, a_t^{(i j)}\right)$ $(r(x_i,y_{ij})\in \{0,1\},\ \beta = 0)$ | Expectation of Token Advantage $\mathbb{E}\left[\hat{A}_k\left(s_t^{(i j)}, a_t^{(i j)}\right) \mid\left(s_t^{(i j)}, a_t^{(i j)}\right)\right]$ | Expectation of the Gradient of Policy Loss $\mathbb{E}_{\mathbb{X}, \mathbb{Y}}\left[\nabla_\theta \hat{\mathcal{L}}\left(\theta_k\right)\right]$ |
|---|---|---|---|
| REINFORCE | $r\left(x_i, y_{i j}\right)$ | $Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)$ | $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ |
| REINFORCE ++ | $\frac{1}{\operatorname{std}(\mathbb{R})}\left[r\left(x_i, y_{i j}\right)-\operatorname{mean}(\mathbb{R})\right]$ | $\frac{1}{\sigma_k}\left(Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-\mu_k\right)$ | $-\frac{1}{\sigma_k} \nabla_\theta \mathcal{J}\left(\theta_k\right)$ asymptotically as $NM$ is large enough |
| RLOO | $r\left(x_i, y_{i j}\right)-\frac{1}{M-1} \sum_{l \neq j} r\left(x_i, y_{i l}\right)$ | $Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-V^{\pi_{\theta_k}}\left(x_i\right)$ | $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ |
| REMAX | $r\left(x_i, y_{i j}\right)-r\left(x_i, \hat{y}_i\right)$, where $\hat{y}_i$ is sampled from the greedy policy of $\hat{\pi}_{\theta_k}$ | $Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-V^{\hat{\pi}_{\theta_k}}\left(x_i\right)$ | $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ |
| PPO | $\begin{aligned} & (\lambda \gamma)^{T_{i j}-1-t} r\left(x_i, y_{i j}\right)+ \ & \sum_{t^{\prime}=t+1}^{T_{i j}-1}\left(\frac{1}{\lambda}-1\right)(\lambda \gamma)^{t^{\prime}-t} V_\phi\left(s_{t^{\prime}}^{(i j)}\right)-V_\phi\left(s_t^{(i j)}\right) \end{aligned}$ | $A_{\lambda, \gamma}^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)$. If $V_\phi=V^{\pi_{\theta_k}}$ and $\gamma=1, A_{\lambda, \gamma}^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)= A^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)$ | $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ If $V_\phi=V^{\pi_{\theta_k}}$ and $\gamma=1$. |
| ORZ | $r\left(x_i, y_{i j}\right)-V_\phi\left(s_t^{(i j)}\right)$ (PPO with $\lambda=\gamma=1$ ) | $Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-V_\phi\left(s_t^{(i j)}\right)$ | $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ |
| GRPO (origin version) | $\frac{1}{T_{i j}}\left[\frac{r\left(x_i, y_{i j}\right)-\hat{p}_i}{\sqrt{\hat{p}_i\left(1-\hat{p}_i\right)}}\right]$ | Hard to analyze due to the length normalization term $\frac{1}{T_{i j}}$ | $-\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta_k}}(\cdot \mid x)\left[\frac{r(x, y) \nabla_\theta \log \pi_{\theta_k}(x, y)}{T \sqrt{p_{\theta_k}(x)\left(1-p_{\theta_k}(x)\right)}}\right] \neq$ $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ asymptotically as $M$ is large enough |
| GRPO (R1 version) | $\frac{r\left(x_i, y_{i j}\right)-\hat{p}_i}{\sqrt{\hat{p}_i\left(1-\hat{p}_i\right)}}$ | $\frac{Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-p_{\theta_k}\left(x_i\right)}{ \sqrt{p_{\theta_k}\left(x_i\right) (1-p_{\theta_k}\left(x_i\right))}}$ asymptotically as $M$ is large enough | $-\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta_k}}(\cdot \mid x)\left[\frac{r(x, y) \nabla_\theta \log \pi_{\theta_k}(x, y)}{\sqrt{p_{\theta_k}(x)\left(1-p_{\theta_k}(x)\right)}}\right] \neq$ $-\nabla_\theta \mathcal{J}\left(\theta_k\right)$ asymptotically as $M$ is large enough |
| DAPO | $\frac{1}{\hat{T}_i}\left[\frac{r\left(x_i, y_{i j}\right)-\hat{p}_i}{\sqrt{\hat{p}_i\left(1-\hat{p}_i\right)}}\right]$, where $\hat{T}_i=\frac{1}{M} \sum_{j=1}^M T_{i j}$ | $\frac{Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-p_{\theta_k}\left(x_i\right)}{T_{\theta_k}\left(x_i\right) \sqrt{p_{\theta_k}\left(x_i\right) (1-p_{\theta_k}\left(x_i\right))}}$ asymptotically as $M$ is large enough | $-\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta_k}(\cdot \mid x)}\left[\frac{r(x, y) \nabla_\theta \log \pi_{\theta_k}(x, y)}{T_{\theta_k}(x) \cdot \sqrt{p_{\theta_k}(x)\left(1-p_{\theta_k}(x)\right)}}\right] \neq -\nabla_\theta \mathcal{J}\left(\theta_k\right)$ asymptotically as $M$ is large enough |
| DR.GRPO | $r\left(x_i, y_{i j}\right)-\hat{p}_i$ | $\frac{M-1}{M}\left(Q^{\pi_{\theta_k}}\left(s_t^{(i j)}, a_t^{(i j)}\right)-V^{\pi_{\theta_k}}\left(x_i\right)\right)$ | $-\frac{M-1}{M} \nabla_\theta \mathcal{J}\left(\theta_k\right)$ |
REINFORCE ++ 和 DR. GRPO 隐式的采用了自适应的学习率。REINFORCE++ 是由于 global batch normalization,DR.GRPO 是由于没有留一。
GRPO (original/R1 version) 和 DAPO 的期望梯度并不是真实的策略梯度。且梯度中渐进存在 $\frac{1}{\sqrt{p_{\theta_k}\left(x_i\right) (1-p_{\theta_k}\left(x_i\right))}} $这一项,因此更偏好学习简单和难的题目(梯度权重更大);由于 GRPO (original version) 中 $\frac{1}{T}$ 的长度正则项的存在,可能更倾向于输出短而正确的答案;由于 DAPO 中 $\frac{1}{T_{\theta_k}\left(x_i\right)}$ 的存在,更倾向于在输出平均长度更短的 prompt 的数据上进行学习。
假设 $\lambda \gamma<1$ ,当 $t \rightarrow 0, ~(\lambda \gamma)^{T_{i j}-1-t} \rightarrow 0$。PPO 中的 Q 值估计 $$ \hat{Q}\left(s_t^{(i j)}, a_t^{(i j)}\right):=(\lambda \gamma)^{T_{i j}-1-t} r\left(x_i, y_{i j}\right)+\sum_{t^{\prime}=t+1}^{T_{i j}-1}\left(\frac{1}{\lambda}-1\right)(\lambda \gamma)^{t^{\prime}-t} V_\phi\left(s_{t^{\prime}}^{(i j)}\right) $$ 其中 critic 的输出 $V_\phi\left(s_{t^{\prime}}^{(i j)}\right)$ 将会占据主导,因此如果 $V_\phi\left(s_{t^{\prime}}^{(i j)}\right)$ 不佳(PPO 训练的关键)将会产生非常大的 bias。ORZ 通过采用 $\lambda=\gamma=1$,Q 值的估计将不受 critic 影响从而不产生 bias。
$\hat{A}_k\left(s_t^{(i j)}, a_t^{(i j)}\right)$(作为 token 梯度向量 $\nabla_\theta \log \pi_{\theta_k}(x, y)$ 的权重出现)的符号近似表明了该 token 处的概率输出值的增减方向。REINFORCE(没有使用 baseline 项)会尝试增加所有采集到的 token 的概率,无论答案是否正确,因此存在较大的方差。
除 PPO 外,所有算法都使用 $r(x_i,y_{ij})$ 作为 MC $Q$ 估计值,在答案正确时通常会尝试增加 token 概率而在错误时降低 token 概率,在减小方差方面无本质差异。 $r(x_i,y_{ij})$ 作为 $Q$ 估计确实会引入训练方差,因为轨迹的最终奖励 $r(x_i,y_{ij})$ 可能与 token 真实的 advantage $\hat{A}_k\left(s_t^{(i j)}, a_t^{(i j)}\right)$ 的符号不匹配(可能答案正确但过程包含模型易犯错的推理步骤,反之也存在最终答案错误但包含好的推理步骤的轨迹)。
除了 PPO 和 ORZ 外,其他所有算法每个状态动作对的 baseline 项(如 $V^{\pi_{\theta_k}}\left(x_i\right)$)都至多是 prompt 级别的。从方差的角度考虑,每个 token 都应该用自己的 $V^{\pi_{\theta_k}}\left(s_t^{(ij)}\right)$ 作为 baseline 项。在二元奖励下,由于ORZ 放弃了使用 critic 对 $Q$ 进行估计,使得虽然 baseline 使用了 $V_{\phi}\left(s_t^{(ij)}\right)$,但实质上只调整了不同 token处的 advantage 的 scale(符号仍和最终轨迹奖励相匹配)。因此从某种程度上来说,只有 PPO 能够实现token-level 的精细策略优化。
笔记#
Mind Map#
