RL: Generalized Policy Iteration: Dynamic Programming vs Monte Carlo

The Major two dynamic programming approach to find an optimal policy given full knowledge of the MDP are Policy Iteration and Value Iteration.
Policy Iteration: Start with an initial policy. Then apply Bellman equation repeatedly to determine state-values. From the state-values, greedily choose the action.

POLICY ITERATION

repeat until policy is stable:
$\quad$ repeat until convergence:
$\quad\quad$ for $s \in S$ do:
$\quad\quad\quad v(s) = \sum_{r, s’} (r + \gamma v(s’))P(r,s’|s, \pi(s))$
$\quad$ for $s \in S$ do:
$\quad\quad \pi(s) = \arg \max_{a} \sum_{r, s’} (r + \gamma v(s’))P(r,s’|s, a)$

Value Iteration: Value iteration updates the state value of state $s$ from the optimal arm $a$. This way, the state values converge to optimal values. And the resulting policy too is optimal.

VALUE ITERATION

repeat until policy is stable:
$\quad$ repeat until convergence:
$\quad\quad$ for $s \in S$ do:
$\quad\quad\quad v(s) = \max_{a} \sum_{r, s’} (r + \gamma v(s’))P(r,s’|s, a)$
for $s \in S$ do:
$\quad \pi(s) = \arg \max_{a} \sum_{r, s’} (r + \gamma v(s’))P(r,s’|s, a)$

Generalized Policy Iteration: In the above approach, first we evaluate the values of the states following a fixed policy. This part is called Policy Evaluation. In the latter part, we update the policy based on the values we found. This part is called Policy Improvement. The general interaction of these two techniques is called Generalized Policy Iteration or GPI.

Drawbacks of DP: There are several pitfalls of DP -

It requires full knowledge of the MDP.
Since it bootstraps, the state-values could be very biased.

Monte Carlo Methods: When we don’t have all the state transition probabilities and the reward structure of the MDP, we can learn from experience. We can sample a lot of trajectories and from them we can estimate the state-values $v_{\pi}(s)$ and the action-values $q_{\pi}(s,a)$ given a policy. In the event when the policy isn’t fixed, we can adjust our policy based on the current estimates of $v(s)$ and $q(s,a)$.

Monte Carlo Prediction: We know that, given a trajectory, $G_{t} = R_{t} + \gamma R_{t+1} + \gamma^{2} R_{t+2} + \cdots + \gamma^{T} R_{t+T}$ which can be recurisvely expressed as $G_{t} = R_{t} + \gamma G_{t+1}$. So, we can get $G_{t}$ from $G_{t+1}$. This is how we can update the value of $G_{t}$. But it could also occur that a state is repeated multiple times in a trajectory. One thing to consider is whether we should update it’s value at the very first occurrence in the trajectory of in every visit. Hence, we have two approaches to this Monte Carlo prediction.

FIRST VISIT MONTE CARLO PREDICTION

repeat until convergence:
$\quad$ Generate an episode under $\pi$: $S_{0}, A_{0}, R_{0}, S_{1}, \cdots, S_{T-1}, A_{T-1}, R_{T-1}$
$\quad t = T-1$
$\quad$ while $t \geq 0$ do:
$\quad\quad G = R_{t} + \gamma G$
$\quad\quad$If $S_{t}$ is the first occurrence of this state:
$\quad\quad\quad V(S_{t}) =$ adjust$(V(S_{t}), G)$
$\quad\quad t = t - 1$

EVERY VISIT MONTE CARLO PREDICTION

repeat until convergence:
$\quad$ Generate an episode under $\pi$: $S_{0}, A_{0}, R_{0}, S_{1}, \cdots, S_{T-1}, A_{T-1}, R_{T-1}$
$\quad t = T-1$
$\quad$ while $t \geq 0$ do:
$\quad\quad G = R_{t} + \gamma G$
$\quad\quad V(S_{t}) =$ adjust$(V(S_{t}), G)$
$\quad\quad t = t - 1$

Monte Carlo Control: One can also use MC methods for control by updating the policy after each episode and by generating the next episode under the new policy. One such approach is MC exploring starts. It combines First Visit MC with Generalized Policy Iteration. Instead of estimating the value $v(s)$, we estimate $q(s,a)$ and update the policy according to that.

MC Exploring Starts

repeat until convergence:
$\quad$ Randomly generate $S_{0}, A_{0}$
$\quad$ Generate an episode under $\pi$: $S_{0}, A_{0}, R_{0}, S_{1}, \cdots, S_{T-1}, A_{T-1}, R_{T-1}$
$\quad t = T-1$
$\quad$ while $t \geq 0$ do:
$\quad\quad G = R_{t} + \gamma G$
$\quad\quad$If $(S_{t}, A_{t})$ is the first occurrence of this state-action pair:
$\quad\quad\quad V(S_{t}) =$ adjust$(V(S_{t}), G)$
$\quad\quad\quad q(S_{t}, A_{t}) =$ adjust$(q(S_{t}, A_{t}), G)$
$\quad\quad\quad \pi(S_{t}) = \arg \max_{a} q(S_{t}, a)$
$\quad\quad t = t - 1$

Drawbacks of Monte Carlo: A few drawbacks are:

The bias may reduce, but is susceptible to high variance in the value estimations due to sampling a diverse set of trajectories.
Can't utilize the updated policy until the end of the episode.