Policy Gradient Methods in Reinforcement Learning

Imagine you are trying to teach a dog to fetch a ball. At first, the dog has no idea what to do when you throw the ball. It might run in different directions, ignore the ball, or do something completely unrelated. Your goal is to teach the dog the best way to fetch the ball and bring it back to you.
Each time the dog does something, you either reward it with a treat or do nothing. If the dog runs toward the ball, you give it a treat. If it does something else, you don't. This set of guidelines or strategies that the dog follows to decide what to do is called a policy. Initially, these guidelines are random, but with training, they become more focused on fetching the ball.
Over time, the dog learns that running towards the ball results in treats. It starts to follow this strategy more often because it leads to rewards. This is essentially how Policy Gradient Methods work. In this article, we will explore their mechanisms and math, and we will use the OpenAI Gym to train a car to drive across mountains. Let's get started!
Index
1: Introduction to Policy Gradient Optimization ∘ 1.1: What is Policy Gradient? 2: The Anatomy of Policy Gradient Methods ∘ 2.1: Policy Representations ∘ 2.2: Objective Functions 3: Key Components of Policy Gradient Optimization ∘ 3.1: The Policy Network ∘ 3.2: The Reward Function ∘ 3.3: Gradient Estimation 4: Implementing a Policy Gradient Method from Scratch ∘ 4.1: Setting Up the Environment and Hyperparameters ∘ 4.2: Defining the Policy Network ∘ 4.3: Implementing the REINFORCE Algorithm ∘ 4.4: Optimizing Hyperparameters with Optuna ∘ 4.5: Running the Training Process 5: Conclusion References
1: Introduction to Policy Gradient Optimization
In reinforcement learning, the agent (like the dog) learns to perform actions to achieve the best outcomes (like fetching the ball for treats). Policy gradient optimization is about adjusting the guidelines (policy) the agent follows so that it can maximize its rewards over time.
1.1: What is Policy Gradient?
Unlike value-based methods like Q-learning, which estimate the value of actions in specific states, policy gradient methods adjust the policy's parameters to maximize the expected cumulative reward.
In reinforcement learning, a policy defines an agent's behavior by specifying the probability distribution over actions given by a state. Mathematically, a policy π is represented as π(a|s; θ), where "a" is the action, "s" is the state, and "θ" are the policy parameters.
The goal of policy gradient methods is to find the optimal policy parameters (θ*) that maximize the expected return J(θ), which is the expected cumulative reward the agent receives from interacting with the environment. This objective can be expressed as:

where "γ" is the discount factor, and "r_t" is the reward at time step "t".
To achieve this, policy gradient methods use the gradient ascent algorithm. The parameters "θ" are updated in the direction of the gradient of the expected return with respect to the policy parameters. The update rule can be written as:

where "α" is the learning rate, and ∇θ J(θ) is the gradient of the expected return with respect to the policy parameters.
The gradient ∇θ J(θ) is estimated using samples from the environment. One popular method for this is the REINFORCE algorithm, which uses the following gradient estimate:

where "R" is the cumulative reward obtained from the sampled trajectory. This estimate leverages the log-likelihood trick to compute the gradient efficiently.
2: The Anatomy of Policy Gradient Methods
2.1: Policy Representations
The policy is the core component that determines the agent's behavior by mapping states to actions. The way this mapping is structured is called policy representation. There are several ways to represent policies, each with its advantages and trade-offs. Let's explore some common policy representations used in policy gradient methods.
2.1.1: Parametric PoliciesParametric policies use a set of parameters, like the weights and biases of a neural network, to define the policy. Think of it like baking a cake: the recipe (policy) tells you the quantities (parameters) of each ingredient (weights and biases) you need. The policy is often modeled as a probability distribution over actions given a state, and the neural network helps determine this distribution.
Discrete Actions For environments with discrete action spaces, the policy can be represented as a categorical distribution. Imagine you're at a restaurant choosing a meal from a menu. The neural network outputs a probability for each possible dish, and you select one based on these probabilities. The softmax function converts the output into a probability distribution over actions, similar to how you weigh your options and pick one.

Here, f(s;θ) is the output of the neural network, and the softmax function converts the output into a probability distribution over actions.
Continuous Actions For environments with continuous action spaces, the policy is often represented as a Gaussian distribution. Picture adjusting the volume on a stereo: the neural network outputs the mean (desired volume) and standard deviation (how much you might adjust it up or down). The agent samples an action from this distribution, like finding the right volume level by slightly adjusting the knob around the desired setting.

Here, μ(s;θ) and σ(s;θ) are the mean and standard deviation produced by the neural network for a given state s.
2.1.2: Non-Parametric PoliciesNon-parametric policies don't rely on a fixed set of parameters to define the policy. Instead, they use techniques like nearest neighbor or kernel-based methods. This is like having a flexible recipe book where you choose the best recipe based on what's in your pantry. These methods are less common in deep reinforcement learning but can be useful when flexibility and adaptability are needed.
2.1.3: Deterministic PoliciesWhile policy gradient methods are typically associated with stochastic policies, deterministic policies can also be used, especially in continuous action spaces. Deterministic policies directly map states to actions without introducing randomness in the action selection process.
The Deterministic Policy Gradient (DPG) algorithm optimizes deterministic policies by adjusting the policy parameters to maximize the expected return. The policy is represented as a deterministic function μ(s;θ), and the gradient of the expected return is computed with respect to the policy parameters.

2.1.4: Hybrid PoliciesHybrid policies combine elements of both parametric and non-parametric approaches. For example, a policy might use a neural network to parameterize a mixture of Gaussians, where the network outputs the parameters of multiple Gaussian distributions, and the final policy is a weighted sum of these distributions. Hybrid policies can provide additional flexibility and expressiveness in modeling complex behaviors.
2.2: Objective Functions
The objective function in policy gradient methods defines what the agent is trying to optimize. It measures how well a policy performs and guides the learning process by setting the criteria for updating policy parameters. In these methods, the goal is to maximize the expected cumulative reward. Let's explore different types of objective functions.
Expected ReturnThe most basic objective function in policy gradient methods is the expected return. This represents the total expected reward an agent gets when following a policy, denoted as