Proving Policy Gradient Theorem

Policy Gradient Theorem

Optimize:

$$
J(\theta)\doteq v_{\pi_\theta}(s_0)
$$

with discount factor $\gamma=1$.

To get the gradient:

$$
\begin{aligned}
\nabla J(\theta)&=\nabla\left[\sum_{a_0}\pi(a_0|s_0)q_\pi(s_0,a_0)\right]\
&=\sum_{a_0}\left[\nabla\pi(a_0|s_0)q_\pi(s_0,a_0)+\pi(a_0|s_0)\nabla q_\pi(s_0,a_0)\right]\
&=\sum_{a_0}\nabla\pi(a_0|s_0)q_\pi(s_0,a_0)+\sum_{a_0}\pi(a_0|s_0)\nabla q_\pi(s_0,a_0)
\end{aligned}
$$

For $\nabla q_\pi(s_0,a_0)$:

$$
\begin{aligned}
\nabla q_\pi(s_0,a_0)&=\nabla\sum_{s_1}p(s_1,r_1|s_0,a_0)(r_1+v(s_1))\
&=\sum_{s_1}p(s_1|s_0,a_0)\nabla v(s_1)
\end{aligned}
$$

For $\nabla v(s_1)$:

$$
\begin{aligned}
\nabla v(s_1)&=\sum_{a_1}\nabla\pi(a_1|s_1)q_\pi(s_1,a_1)+\sum_{a_1}\pi(a_1|s_1)\sum_{s_2}p(s_2|s_1,a_1)\nabla v(s_2)
\end{aligned}
$$

Let $\Pr(s_0\to x,t,\pi)$ be the probability of reaching $x$ after $t$ steps from $s_0$ under the policy $\pi$.

Hence,

$$
\begin{aligned}
\nabla J(\theta)&=\sum_{a_0}\nabla\pi(a_0|s_0)q_\pi(s_0,a_0)+\sum_{a_0}\pi(a_0|s_0)\cdot\sum_{s_1}p(s_1|s_0,a_0)\left[\sum_{a_1}\nabla\pi(a_1|s_1)q_\pi(s_1,a_1)+…\right]\
&=\sum_{s_0}\Pr(s_0\to s_0,0,\pi)\sum_{a_0}\nabla\pi(a_0|s_0)q_\pi(s_0,a_0)\
&+\sum_{s_1}\Pr(s_0\to s_1,1,\pi)\sum_{a_1}\nabla\pi(a_1|s_1)q_\pi(s_1,a_1)+…\
&=\sum_{s_0}\Pr(s_0\to s_0,0,\pi)\sum_{a_0}\pi(a_0|s_0)q_\pi(s_0,a_0)\dfrac{\nabla\pi(a_0|s_0)}{\pi(a_0|s_0)}\
&+\sum_{s_1}\Pr(s_0\to s_1,1,\pi)\sum_{a_1}\pi(a_1|s_1)q_\pi(s_1,a_1)\dfrac{\nabla\pi(a_1|s_1)}{\pi(a_1|s_1)}+…\
&=\sum_{s_0}\Pr(s_0\to s_0,0,\pi)\sum_{a_0}\pi(a_0|s_0)q_\pi(s_0,a_0)\nabla\ln\pi(a_0|s_0)\
&+\sum_{s_1}\Pr(s_0\to s_1,1,\pi)\sum_{a_1}\pi(a_1|s_1)q_\pi(s_1,a_1)\nabla\ln\pi(a_1|s_1)+…\
&=\sum_{t=0}^\infty\sum_{s_t}\Pr(s_0\to s_t,t,\pi)\sum_{a_t}\pi(a_t|s_t)q_\pi(s_t,a_t)\nabla\ln\pi(a_t|s_t)
\end{aligned}
$$

Now we have the base form of Policy Gradient Theorem, and we can derive other forms.

$$
\begin{aligned}
\nabla J(\theta)&=\sum_{t=0}^\infty\sum_{s_t}\Pr(s_0\to s_t,t,\pi)\sum_{a_t}\pi(a_t|s_t)q_\pi(s_t,a_t)\nabla\ln\pi(a_t|s_t)\
&=\sum_{x\in\mathcal S}\sum_{t=0}^\infty\Pr(s_0\to x,t,\pi)\sum_a\pi(a|x)q_\pi(x,a)\nabla\ln\pi(a|x)\
&=\sum_{x\in\mathcal S}\sum_a\pi(a|x)q_\pi(x,a)\nabla\ln\pi(a|x)\
&=\mathop{\mathbb{E}}\limits_{a\sim\pi(a|s)}\left[q_\pi(s,a)\nabla\ln\pi(a|s)\right]
\end{aligned}
$$

Powered by Hexo and Hexo-theme-hiker

Copyright © 2023 - 2024 Charley's Hut All Rights Reserved.

UV : | PV :