S_t is given
Let \tau = (S_0,A_0,S_1,A_1,\cdots). We call \tau a trajectory of this game.
For any fix t. A measure of good of A_t is \mathbf{E}\bigl[ G_t \bigr], where \begin{aligned} G_t &= R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots \cr &= \sum_{k=t}^\infty \gamma^{k-t} R_k\bigl(S_k(\theta),A_k(\theta)\bigr). \end{aligned} Our goal is to maximize \sum_{t} \mathbf{E}\bigl[ G_t \bigr].
If we already play at time t-1. That is, we already samples the follows: \begin{aligned} s_0 &\sim p(s_0) , & a_0 &\sim \pi_{\theta} (a_0\vert s_0) , \cr s_1 &\sim p(s_1\vert s_0,a_0) , & a_1 &\sim \pi_\theta(a_1\vert s_1), \cr &\vdots & &\vdots \cr s_{t-1} &\sim p(s_{t-1}\vert s_{t-2},a_{t-2}) , & a_{t-1} &\sim \pi_\theta(a_{t-1}\vert s_{t-1}). \end{aligned} A measure of good of A_t is \begin{aligned} G_t &= R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots \cr &= \sum_{k=t}^\infty \gamma^{k-t} R_k\bigl(S_k(\theta),A_k(\theta)\bigr). \end{aligned} Our goal is to maximize \mathbf{E}_{\theta} \bigl[ G_t \bigr], where \begin{aligned} \mathbf{E}_{\theta} \bigl[ G_t \bigr] \end{aligned}
\begin{aligned} \mathbf{E} \bigl[ R_t \bigr] &= \sum_{a\in A} R_t(s_t,a) \cdot \mathbf{P} \bigl[ A_t(\theta) = a \bigr] \cr &= \sum_{a\in A} R_t(s_t,a) \cdot \pi_\theta(a \vert s_t) \end{aligned}
For k\geq t, \begin{aligned} \mathbf{E}_\theta \bigl[ R_k \bigr] = \end{aligned}
\begin{aligned} \mathbf{E} \bigl[ R_n \big\vert given \bigr] = \mathbf{E} \bigl[ R_n \bigl( \underline{S_n(\theta),A_n(\theta)} \bigr) \bigr] = \end{aligned}