Monte Carlo Methods for Solving Reinforcement Learning Problems
We continue our deep dive into Sutton's great book about RL [1] and here focus on Monte Carlo (MC) methods. These are able to learn from experience alone, i.e. do not require any kind of model of the environment, as e.g. required by the Dynamic programming (DP) methods we introduced in the previous post.
This is extremely tempting – as often the model is not known, or it is hard to model the transition probabilities. Consider the game of Blackjack: even though we fully understand the game and the rules, solving it via DP methods would be very tedious – we would have to compute all kinds of probabilities, e.g. given the currently played cards, how likely is a "blackjack", how likely is it that another seven is dealt … Via MC methods, we don't have to deal with any of this, and simply play and learn from experience.

Due to not using a model, MC methods are unbiased. They are conceptually simple and easy to understand, but exhibit a high variance and cannot be solved in iterative fashion (bootstrapping).
As mentioned, here we will introduce these methods following Chapter 5 of Sutton's book. It is at this point (at least in my opinion) that difficulty picks up, but mastering this (and following chapters) really rewards you with a better understanding of RL.
This post will be structured as follows: we begin by introducing MC methods and, similar to the previous post, the "prediction" problem. Next, we move to the "control" problem. We show a first MC control algorithm building on two (unlikely) assumptions: that we'll see infinitely many episodes, and that every state-action pair will be visited infinitely often (exploring starts).
The remainder of this post will be about removing these assumptions: while the first can be disposed of rather quickly, the latter takes some more consideration: a first on-policy method is presented in which the optimal policy is left ε-greedy, and then we move to off-policy methods which involve importance sampling.
Thanks for tuning in, and, if you like this post and others, please consider leaving a clap or comment, and consider following me on Medium!
As usual, all code is available on GitHub.
Monte Carlo Prediction
As mentioned we begin with the prediction problem – that is, estimating the value function for a given policy. Our described method will be very similar to the one used to solve the Multi-armed Bandits problem from Post 1— the only difference being that here we have multiple states.
Core idea is to sample episodes, and collect and average returns following states to compute the value function – in the limit matching the expected value. This principle is common to all MC methods, and gave rise to their name (due to the commonness of gambling in the country of Monte Carlo).
We can define two slightly different ways of computing this average: first-visit MC methods and every-visit MC methods. When generating an episode, it can happen that states are visited more often in this. First-visit MC estimates a state's value averaging only about returns following the first occurrence, every-visit MC methods average returns following all visits. These methods have slightly different theoretical properties, but we won't dwell on that here and instead only focus on first-visit MC. In pseudocode this looks as follows:

Now this is actually not ideal in this setting: in model-based approaches, when estimating value functions, for any state we can simply explore all actions and then chose that action leading us to the highest-valued state. Here, we do not have such a model, thus it is much more appealing to estimate the value of state-action pairs, i.e. the action-value function. In general this can be done exactly as shown above for the value function, with only one complication: it is likely that several state-action pairs are never visited, and without any results to average their value estimators will not improve.
This relates to the exploration-exploitation trade-off (introduced in Part 1), and to continuously keeping on exploring. One assumption to guarantee this is to start an episode in a random state / action combination, and requiring all actions to have non-zero probability. This is called the exploring starts assumption. We'll combine its implementation with the next section, MC control.
Monte Carlo Control
As in the post about Dynamic Programming, we'll use generalized policy iteration (GPI) – iterative steps of policy evaluation and policy improvement.
There is nothing special about this process here per se – except that we have two make two (unlikely) assumptions to make it work: we need the assumption of exploring starts, and we require that the inner loop of policy evaluation is done for infinitely many steps.
The latter we can remove relatively easily, analogously as described in the DP post: it is okay to break the policy evaluation loop early, only letting the policy's value move closer to the optimum instead of reaching it – with the extreme case being value iteration as shown in that post.
For now, we will keep the exploring starts assumption and show one control algorithm with it – and in the last part of the post show how to remove also this.
I mentioned at the start of this post, that here the difficulty starts picking up theoretically – but also practically implementing these algorithms took me a few tries, and I found some little "tricks and tweaks" to make them work – which I will explain while introducing the respective algorithms.
Monte Carlo Control with Exploring Starts
This is how the pseudocode for MC control with ES looks like:

Now let's translate this to Python code:
The first difficulty we encounter is how to guarantee the ES assumption: gymnasium [2], and also RL environments in general are not meant for jumping to arbitrary states. Instead, you typically initialize the environment and then act from this starting state (one problem for example is history: if you jump to a random state, how would you generate potentially needed historic frames?).
Maybe it would have been possible to "hack" gymnasium, or to implement custom environments enabling generating random states (please let me know in the comments if you know of other / better ways!), but I decided to implement a function generating several possible states of an environment, and then sampling from this list.
This function is called several times (for every generated episode), so we want to cache this expensive generation process. However, e.g. the available functools.lrucache decorator has issues with determining equality for gymnasium environments – so I implemented a custom caching decorator ignoring the first (the environment) argument (another option would have been to override the equals method for environments):
def call_once(func):
"""Custom cache decorator
ignoring the first argument.
"""
cache = {}
def wrapper(*args, **kwargs):
key = (func.__name__, args[1:])
assert not kwargs, "We don't support kwargs atm"
if key not in cache:
cache[key] = func(*args)
return cache[key]
return wrapper
@call_once
def generate_possible_states(
env: ParametrizedEnv, num_runs: int = 100
) -> list[tuple[int, ParametrizedEnv]]:
"""Generate possible states of an environment.
For this, iteratively increase the set of already known
states by picking a random state and following a random
policy from then on, noting down new states.
Args:
env: environemnt to use
num_runs: number of discover loops
Returns:
list containing found states - which are tuples of states (observations)
and the gym environment reprsenting that state
"""
_, action_space = get_observation_action_space(env)
observation, _ = env.env.reset()
possible_states = [(observation, copy.deepcopy(env))]
for _ in range(num_runs):
observation, env = random.choice(possible_states)
env = copy.deepcopy(env)
terminated = truncated = False
while not terminated and not truncated:
action = np.random.choice([a for a in range(action_space.n)])
observation, _, terminated, truncated, _ = env.env.step(action)
if observation in set([state for state, _ in possible_states]):
break
else:
if not terminated and not truncated:
possible_states.append((observation, env))
return possible_states
def generate_random_start(env: ParametrizedEnv) -> tuple[int, ParametrizedEnv]:
"""Pick a random starting state.
For that, first generate all possible
states (cached), then pick a random
state from these.
"""
possible_states = generate_possible_states(env)
observation, env = random.choice(possible_states)
return copy.deepcopy(observation), copy.deepcopy(env)
As you can see, generate_possible_states
iteratively builds up a set of possible states: for this, we first select a state from the already known states, and then follow random actions while noting down all new states.
Further, we implement a helper function to generate episodes following a specific policy:
def generate_episode(
env: ParametrizedEnv,
pi: np.ndarray,
exploring_starts: bool,
max_episode_length: int = 20,
) -> list[tuple[Any, Any, Any]]:
"""Generate an episode following the given policy.
Args:
env: environment to use
pi: policy to follow
exploring_starts: true when to follow exploring state assumption (ES)
Returns:
generated episode
"""
_, action_space = get_observation_action_space(env)
episode = []
observation, _ = env.env.reset()
if exploring_starts:
# Pick random starting state if ES
observation, env = generate_random_start(env)
terminated = truncated = False
initial_step = True
while not terminated and not truncated:
if initial_step and exploring_starts:
# Pick random action initially if ES
action = np.random.choice([a for a in range(action_space.n)])
else:
action = np.random.choice(
[a for a in range(action_space.n)], p=pi[observation]
)
initial_step = False
observation_new, reward, terminated, truncated, _ = env.env.step(action)
episode.append((observation, action, reward))
# Terminate episodes in which agent is stuck
if len(episode) > max_episode_length:
break
observation = observation_new
return episode
When called with ES, the first action selection is 100% random – otherwise we sample actions according to the policy distribution in every step. It is important to terminate "stuck" episodes early: since in MC methods we play out full episodes and only update our policy afterwards, it can happen the policy is stuck in some non-ideal action leading to cyclic visit sequences (e.g. turning left on the top-left corner – this action does not have any effect). To not wait indefinitely (or until the environment terminates), we break after a certain number of steps.
With this we can finally implement above introduced MC ES algorithm:
def mc_es(env: ParametrizedEnv) -> np.ndarray:
"""Solve passed Gymnasium env via Monte Carlo
with exploring starts.
Args:
env: env containing the problem
Returns:
found policy
"""
observation_space, action_space = get_observation_action_space(env)
pi = (
np.ones((observation_space.n, action_space.n)).astype(np.int32) / action_space.n
)
Q = np.zeros((observation_space.n, action_space.n))
returns = defaultdict(list)
num_steps = 1000
for t in range(num_steps):
episode = generate_episode(env, pi, True)
G = 0.0
for t in range(len(episode) - 1, -1, -1):
s, a, r = episode[t]
G = env.gamma * G + r
prev_s = [(s, a) for (s, a, _) in episode[:t]]
if (s, a) not in prev_s:
returns[s, a].append(G)
Q[s, a] = statistics.fmean(returns[s, a])
if not all(Q[s, a] == Q[s, 0] for a in range(action_space.n)):
for a in range(action_space.n):
pi[s, a] = 1 if a == np.argmax(Q[s]) else 0
return np.argmax(pi, 1)
The mapping from Python to the peusocode above should be relatively simple – but one important addition I needed to make was the equality check before updating the policy: it can happen that no non-zero reward is found when generating an episode, so the policy update initially would maximize some action without information.
Monte Carlo Control without Exploring Starts
Above we have seen a first MC control method with the ES assumption. But we have also seen first hand, how clunky this is to work with – thus in the remainder of this post we'll introduce two methods getting along without this assumption.
In general for this to work we have to ensure all state-action pairs will be explored indefinitely (*), and there are two classes of methods which help us realize this: on-policy and off-policy methods. On-policy methods try to improve the policy from which the data was generated, while off-policy methods can improve a target policy while generating data from a different, a behavior policy.
Here we start with an on-policy method, and then later introduce an off-policy one. All the methods we have considered so far were on-policy methods, as there was only a single policy – one which we used to generate data from, and which we also optimized directly. To satisfy (*) in on-policy methods, the idea is to keep the policy "soft", instead of the "hard" updates we did above. That is, when updating the policy towards an action with maximal Q
value, we don't go all the way towards a deterministic policy, but instead only move the policy towards such a deterministic policy while maintaining

for all states and actions.
But apart from that, there is not much difference to above introduced MC ES, and the similarities in structure should be obvious:

Same for the Python implementation:
def on_policy_mc(env: ParametrizedEnv) -> np.ndarray:
"""Solve passed Gymnasium env via on-policy Monte
Carlo control.
Args:
env: env containing the problem
Returns:
found policy
"""
observation_space, action_space = get_observation_action_space(env)
pi = (
np.ones((observation_space.n, action_space.n)).astype(np.int32) / action_space.n
)
Q = np.zeros((observation_space.n, action_space.n))
returns = defaultdict(list)
num_steps = 1000
for _ in range(num_steps):
episode = generate_episode(env, pi, False)
G = 0.0
for t in range(len(episode) - 1, -1, -1):
s, a, r = episode[t]
G = env.gamma * G + r
prev_s = [(s, a) for (s, a, _) in episode[:t]]
if (s, a) not in prev_s:
returns[s, a].append(G)
Q[s, a] = statistics.fmean(returns[s, a])
if not all(Q[s, a] == Q[s, 0] for a in range(action_space.n)):
A_star = np.argmax(Q[s, :])
for a in range(action_space.n):
pi[s, a] = (
1 - env.eps + env.eps / action_space.n
if a == A_star
else env.eps / action_space.n
)
return np.argmax(pi, 1)
Off-Policy Monte Carlo Control
To introduce off-policy methods, we'll need a bit longer. Their upbringing stems from a compromise we undergo for on-policy methods: we would like to learn an optimal policy, but still have to keep exploring. And, in fact, the on-policy methods presented so far solve exactly for such a compromise: they learn a near-optimal policy which still explores. It seems much more intuitive to use two policies: to optimize a target policy, while using a second policy to explore and generate data.
Due to this, off-policy methods are actually more general and powerful (and some of the most famous RL algorithms, such as Q-Learning, are off-policy methods) – but often have higher variance and are slower to converge.
Common for most off-policy methods is the use of importance sampling: since we want to compute expectations for the target policy, but have actually sampled from a different policy, the expectation is off – biased. Importance sampling scales the observed values in a way, s.t. the resulting estimate is correct again. Due to space constraints I here refrain from a more detailed introduction, but would like to refer the reader to the linked post or any other information source for more.
To apply this to RL, let's again start again with the prediction problem. First let's look at the probability of a state-action trajectory occurring when following a policy π:

Now we introduce our two policies: π is our target policy, the policy we want to optimize, and b
our behavior policy – the policy used to generate data. We need the assumption of coverage: if some action is taken under π, it also needs to have non-negative probability under b
– implying b
needs to be a stochastic policy in states where it is not identical to π.
Next we compute the probability ratio of a trajectory occurring under πand b
:

Conveniently, any properties of the environment itself, such as transition probabilities disappear, and we are left with the policy ratios.
We need this for off-policy prediction, since we want to estimate the value of policy π from data collected by b
– meaning the expectation

would be off. But, luckily we can fix this via importance sampling:

Concretely applying this to the RL prediction problem, we get:

This should look very familiar: the predicted value of a state simply is the (importance-sampling corrected) average return (we divide by the number of states used to compute the expectation – the typical Monte Carlo expectation).
But for our algorithm we'll actually use weighted importance sampling:

That is, it also includes the importance sampling weights in the denominator, i.e. computes a weighted average (note that the weights of nominator and denominator usually don't cancel out since they appear in sums).
Sutton explains the intuition behind this very nicely in my opinion: if we only observed a single return, the weights for weighted importance sampling weight would cancel out. The expectation would be off / biased (it would yield v_b
), but it seems reasonable, since we only observed this. For ordinary importance sampling the expectation always would be correct, but it would be extreme – i.e. imagine the importance sampling weight was 10, or 100. In general, weighted importance sampling is biased (although the bias goes towards 0 with more samples), but has a lower variance than ordinary importance sampling, which can have unbounded (infinite) variance.
So, coming back to off-policy MC prediction – we want to form the estimate:

We could, of course, compute this measurement separately in every step – that is, keep a list of weights and returns, and compute the corresponding values.
However, Sutton immediately introduces an incremental approach to save memory. In this, we only need to store the previous estimate and use new values to update it, similar to how we introduced the incremental update in the post about bandits:

Further, we also need to keep a running estimate of the accumulative weights so far:

To me it is not obvious to see that these formulas indeed are the incremental versions of weighted importance sampling. But trust me, they are correct. I will omit the proof here to keep the post lean, and also because I think it does not help understanding too much – but doing so is Excercise 5.10 of Sutton's book (and you are able to find solutions online for this, if you google). However, I could imagine a simpler, non-incremental version of this algorithm could help understanding – thus you will find exactly that at the end of this chapter.
With this incremental update scheme, MC off-policy prediction looks as follows:

With this, let's directly go to off-policy MC control:

We see the exact same importance sampling update pattern, only now we also update a policy. Note that the W != 0
check from above is now replaced by a "break" whenever the selected action does not correspond to π (which implies W = 0
, since π is deterministic). For the same reason the update rule is also 1/b(...)
instead of π(...)/b(...)
.
This now translates to Python 1:1:
def off_policy_mc(env: ParametrizedEnv) -> np.ndarray:
"""Solve passed Gymnasium env via off-policy Monte
Carlo control.
Args:
env: env containing the problem
Returns:
found policy
"""
observation_space, action_space = get_observation_action_space(env)
Q = np.zeros((observation_space.n, action_space.n))
C = np.zeros((observation_space.n, action_space.n))
pi = np.argmax(Q, 1)
num_steps = 1000
for _ in range(num_steps):
b = np.random.rand(int(observation_space.n), int(action_space.n))
b = b / np.expand_dims(np.sum(b, axis=1), -1)
episode = generate_episode(env, b, False)
G = 0.0
W = 1
for t in range(len(episode) - 1, -1, -1):
s, a, r = episode[t]
G = env.gamma * G + r
C[s, a] += W
Q[s, a] += W / C[s, a] * (G - Q[s, a])
pi = np.argmax(Q, 1)
if a != np.argmax(Q[s]):
break
W *= 1 / b[s, a]
return pi
To conclude this post, as promised, we'll show a "simpler" and more naive version of Sutton's incremental algorithm – i.e. we will implement "pure" importance sampling without the incremental update. I believe this can be helpful for understanding, and it nicely shows how importance sampling really is used, forming all required terms one by one:
def off_policy_mc_non_inc(env: ParametrizedEnv) -> np.ndarray:
"""Solve passed Gymnasium env via on-policy Monte
Carlo control - but does not use incremental algorithm
from Sutton for updating the importance sampling weights.
Args:
env: env containing the problem
Returns:
found policy
"""
observation_space, action_space = get_observation_action_space(env)
Q = np.zeros((observation_space.n, action_space.n))
num_steps = 10000
returns = defaultdict(list)
ratios = defaultdict(list)
for _ in range(num_steps):
b = np.random.rand(int(observation_space.n), int(action_space.n))
b = b / np.expand_dims(np.sum(b, axis=1), -1)
episode = generate_episode(env, b, False)
G = 0.0
ratio = 1
for t in range(len(episode) - 1, -1, -1):
s, a, r = episode[t]
G = env.gamma * G + r
# Create current target policy,
# which is the argmax of the Q function,
# but gives equal weighting to tied Q values
pi = np.zeros_like(Q)
pi[np.arange(Q.shape[0]), np.argmax(Q, 1)] = 1
uniform_rows = np.all(Q == Q[:, [0]], axis=1)
pi[uniform_rows] = 1 / action_space.n
ratio *= pi[s, a] / b[s, a]
if ratio == 0:
break
returns[s, a].append(G)
ratios[s, a].append(ratio)
Q[s, a] = sum([r * s for r, s in zip(returns[s, a], ratios[s, a])]) / sum(
[s for s in ratios[s, a]]
)
Q = np.nan_to_num(Q, nan=0.0)
return np.argmax(Q, 1)
Let's dive a bit deeper here. As in our very first MC control algorithm (MC ES), we want to collect and average returns for all state-action pairs – only here we do every-visit MC instead of first-visit (thus there is no check if the pair already occurs elsewhere in the episode). Thus, very similarly, we build our Q
function by generating episodes and averaging the collected returns – however, here we do not simply average, but instead use (weighted) importance sampling. The quantity we need to compute here is:

The set τ denotes all timesteps in which state s
is visited – for first-visit MC it would only include all timesteps with first visits to s
(also, remember we work with state-action pairs – but Sutton introduces importance sampling with states, which I will follow here for simplicity). Ignoring the importance sampling weights, this formula would yield exactly MC ES – for every occurrence of a state, we collect its corresponding return G_t
, and average. Now, in addition we need the importance sampling weight s— fixing our biased estimate. These are:

Thus, in addition to keeping track of returns for every visited state-action pair, we fill ratios with this quantity. To generate it, while running backwards over an episode, we iteratively compute p_{T-1:T-1}
, p_{T-2:T-1}
and so on, and store this running computation in ratio
– which we append to ratios in every step. To do so, we on-the-fly compute a current target policy π for estimating the correct weights – which simply is the deterministic policy obtained from the current Q
, but we break ties evenly.
This approach works (like all the others), but I still found empirically that it needs ~10x more episodes for guaranteed convergence than the incremental off-policy MC control algorithm – which is unexpected. It is my understanding, that I here simply implemented a non-incremental version of this with exactly the same functionality. But ultimately I stopped debugging this and hope, this additional implementation brings more clarity into how off-policy MC control works. If you have any questions, comments, or suggestions / solutions – in particular why the results differ – please let me know!
Summary
This brings us to the end of this post. I hope, you enjoyed this read and found it informative. If you did, please don't forget to leave a clap or comment, and follow me here to get notified of future posts in this series!
Let's summarize: in this post we followed Chapter 5 of Sutton's book [1], introducing Monte Carlo methods for RL. MC methods learn from experience alone, without requiring a model – which is a stunning and exciting opportunity. As in the previous posts, we started with the prediction problem – explaining how to predict the value of a policy using an MC approach.
Next, we moved on to control – actually finding a good (optimal) policy. We started with an MC control algorithm using the exploring starts (ES) assumption – that every state-action pairs is explored indefinitely with non-zero probability. After this, we tried removing the ES assumption – starting by listing an on-policy algorithm which does not output a deterministic policy, but an ε-greedy near-optimal policy. Then, we opened the powerful toolbox of off-policy algorithms: keeping a target and a behavior policy, and using importance sampling to fix biased expectations obtained by sampling from the (wrong) policy. Sutton presents an elegant incremental algorithm, which we cover, too. Eventually, in addition we also present a non-incremental off-policy MC control algorithm – potentially helping understanding and showing how to implement off-policy MC control in a simple way, without any optimization tricks.
As always, you can find all the code on GitHub. Thanks for reading!
Other Posts in this Series
- Part 1: Introduction to Reinforcement Learning and Solving the Multi-armed Bandit Problem
- Part 2: Introducing Markov Decision Processes, Setting up Gymnasium Environments and Solving them via Dynamic Programming Methods
References