策略梯度方法¶

基于Q学习的DQN面临如下问题：

仅适合处理离散动作空间的问题，由于计算过程中需要在\(\mathcal{A}\)上计算\(\max\)，在无限动作空间上无法实现。
使用函数近似的方法会导致估计的Q值存在偏差，策略可能不稳定甚至发散。
Q值的优化和最终收益的优化之间存在不一致，并不能保证Q值的优化会导致最终收益的优化。
Q学习仅能学习到确定性的策略，而不是学习到策略的分布。

相比之下，策略模型则直接学习到动作的分布\(\pi(a\mid s; \theta)\)。策略模型的优化目标为最大化收益

\[ \begin{aligned} J(\theta) &= \mathbb{E}_{\pi_\theta} \sum_{t=0}^\infty \gamma^t r_{t + 1} \\ &= \sum_{t = 0}^\infty \gamma^t \sum_{s_t} P(s_t) \sum_{a_t} \pi(a_t\mid s_t; \theta) R(s_t, a_t) \\ &= \sum_{t = 0}^\infty \gamma^t \sum_{s_t} \sum_{s_0} P(s_0) P(s_t\mid s_0, \pi) \sum_{a_t} \pi(a_t\mid s_t; \theta) R(s_t, a_t) \\ &= \sum_{t = 0}^\infty \sum_{s}\sum_{s_0} \gamma^t P(s_0) P(s_t = s\mid s_0, \pi) \sum_{a} \pi(a\mid s; \theta) R(s, a) \\ &= \sum_{s}\sum_{s_0} \sum_{t = 0}^\infty \gamma^t P(s_0) P(s_t = s\mid s_0, \pi) \sum_{a} \pi(a\mid s; \theta) R(s, a) \\ \end{aligned} \]

定义\(\rho_\gamma^\pi (s) = \sum_{s_0} \sum_{t = 0}^\infty \gamma^t P(s_0) P(s_t = s\mid s_0, \pi)\)为策略\(\pi\)下，状态\(s\)按\(\gamma\)折扣后的概率分布。

\[ J(\theta) = \sum_{s} \rho_\gamma^\pi (s) \sum_{a} \pi(a\mid s; \theta) R(s, a) \]

即

\[ J(\theta) = \mathbb{E}_{s\sim \rho_\gamma^\pi, a\sim \pi(a\mid s; \theta)} \left[ R(s, a) \right] \]

计算\(J\)对\(\theta\)的梯度，近似假设\(\nabla_\theta\rho_\gamma^\pi (s) = 0\)。注意到\(\nabla_\theta \pi(a\mid s; \theta) = \pi(a\mid s; \theta) \nabla_\theta \log \pi(a\mid s; \theta)\)。由于改变动作\(a\)会导致后续的所有路径发生变化，使用\(Q^{\pi_\theta}(s, a)\)替换\(R(s, a)\)，表示从动作\(a\)后当前策略策略\(\pi_\theta\)的长期回报，类似地，我们同样忽略\(Q\)对\(\theta\)的梯度。

\[ \begin{aligned} \nabla_\theta J(\theta) &= \sum_{s} \rho_\gamma^\pi (s) \sum_{a} \nabla_\theta \pi(a\mid s; \theta) Q^{\pi}(s, a) \\ &= \sum_{s} \rho_\gamma^\pi (s) \sum_{a} \pi(a\mid s; \theta) Q^{\pi_\theta}(s, a) \nabla_\theta \log \pi(a\mid s; \theta) \\ &= \mathbb E_{s\sim \rho_\gamma^\pi, a\sim \pi(a\mid s; \theta)} \left[ Q^{\pi_\theta}(s, a) \nabla_\theta \log \pi(a\mid s; \theta) \right] \\ \end{aligned} \]

称为策略梯度定理。\(\nabla_\theta \log \pi(a\mid s; \theta)\)称为得分函数。直观理解，策略梯度定理指明了策略函数优化的方向是对未来回报提升最快的参数方向，最大化有价值的行为发生的概率。

REINFORCE¶

REINFORCE使用蒙特卡洛方法估计Q值。假设对于一个路径\(\{s_0, a_0, r_0, s_1, a_1, r_1, \ldots s_T\}\)。REINFORCE使用\(G_t \triangleq \sum_{k = 0}^{T - t - 1} \gamma^k r_{t + k}\)作为对\(Q^{\pi_\theta}(s_t, a_t)\)的估计。随后按照策略梯度定理更新梯度。

REINFORCE由于完全依赖蒙特卡洛方法估计\(Q\)值，存在高方差的问题。为了解决这个问题，REINFORCE使用基线（baseline）来减少方差。基线是一个与当前策略无关的模型\(b(s)\)，通常使用状态值函数\(V(s)\)作为基线，将采样的\(Q\)值减去基线\(b(s)\)，作为梯度更新的权重。

\[ \begin{aligned} y(s) &= Q^{\pi_\theta}(s, a) - b_\phi(s) \\ \nabla_\theta J(\theta) &= \mathbb E_{s\sim \rho_\gamma^\pi, a\sim \pi(a\mid s; \theta)} \left[ y(s) \nabla_\theta \log \pi(a\mid s; \theta) \right] \\ \nabla_\phi b_\phi(s) &= \mathbb E_{s\sim \rho_\gamma^\pi, a\sim \pi(a\mid s; \theta)} \left[ Q^{\pi_\theta}(s, a) - b_\phi(s)\right] \\ \end{aligned} \]

在实际训练中，通常会为价值网络设置更高的学习率，以使价值网络更快地收敛到真实的状态值函数。另外，也会为策略引入一个熵项，鼓励策略探索。

\[ \bbH (\pi_\theta) = \bbE_{s, a} \left[ -\log \pi_\theta(a\mid s) \right] \]

Advantage Actor-Critic¶

Advantage Actor-Critic（A2C）是REINFORCE的一个变种。模型首先引入状态价值函数\(V_\phi(s)\)。使用一步时序差分\(r_t + \gamma V(s_{t + 1})\)作为对\(Q^{\pi_\theta}(s_t, a_t)\)的估计。同时，选用该状态价值函数作为基线\(b(s)\)。使用策略梯度更新策略模型，使用MSE损失更新价值模型，得到如下的更新公式：

\[ \begin{aligned} \hat Q(s_t, a_t) &= r_t + \gamma\,\text{stop\_gradient}(V(s_{t + 1})) \\ \theta &\leftarrow \theta + \eta_{\theta} (\hat Q(s_t, a_t) - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t\mid s_t) \\ \phi &\leftarrow \phi - \eta_{\phi} \nabla_\phi (\hat Q(s_t, a_t) - V_\phi(s_t))^2 \end{aligned} \]

Generalized Advantage Estimation¶

将A2C的一步时序差分和REINFORCE的蒙特卡洛方法结合起来，使用\(\text{TD}(\lambda)\)类似的方法估计advantage值。

将advantage展开更多的步数：

\[ \begin{aligned} A_t^{(1)} &= r_t + \gamma V_\phi(s_{t + 1}) - V_\phi(s_t) \\ A_t^{(2)} &= r_t + \gamma (r_{t + 1} + \gamma V_\phi(s_{t + 2})) - V_\phi(s_t) \\ \vdots \\ A_t^{(k)} &= r_t + \gamma (r_{t + 1} + \gamma (r_{t + 2} + \ldots + \gamma V_\phi(s_{t + k}))) - V_\phi(s_t) \\ &= r_t + \gamma r_{t + 1} + \ldots + \gamma^k V_\phi(s_{t + k}) - V_\phi(s_t) \\ \end{aligned} \]

将\(A_t^{(k)}\)按照指数衰减的方式加权：

\[ \begin{aligned} \hat A_t &= \frac{\sum_{k = 0}^T \lambda^k A_t^{(k)}}{\sum_{k = 0}^T \lambda^k} \\ &= \frac{\sum_{k = 0}^T \lambda^k \left( r_t + \gamma r_{t + 1} + \ldots + \gamma^k V_\phi(s_{t + k}) - V_\phi(s_t) \right)}{\sum_{k = 0}^T \lambda^k} \\ \end{aligned} \]

如果按照递推的方式进行计算，有

\[ \begin{aligned} \delta_t &= r_t + \gamma V_\phi(s_{t + 1}) - V_\phi(s_t) \\ \hat A_t &= \delta_t + \gamma \lambda \hat A_{t + 1} \\ \end{aligned} \]

策略梯度方法¶

REINFORCE¶

Advantage Actor-Critic¶

Generalized Advantage Estimation¶

评论