How to Derive the Policy Gradient with Monte Carlo Sampling?

A Review of the Objective

Let $\theta$ represent the parameters of our neural network, and let $\tau$ represent a specific trajectory sampled from the probability distribution of all possible trajectories, denoted by $p_\theta(\tau)$ . Our goal is to find the optimal parameter $\theta^\star$ that maximize our expected value.

We define our objective function, $J(\theta)$ , as the expected reward over these trajectories:

J(\theta) = E_{\tau \sim p_\theta(\tau)}[r(\tau)].

To maximize this expected value, we need to find the gradient of $J(\theta)$ with respect to $\theta$ . We will show that this gradient can be expressed in a form that allows for estimation via Monte Carlo sampling.

Remark

A brief note on the notation $\tau \sim p_\theta(\tau)$ :

The term $p_\theta(\tau)$ defines the probability distribution over all possible trajectories, parameterized by our network weights $\theta$ .

The variable $\tau$ refers to one specific, concrete trajectory.

The symbol $\sim$ means “is sampled from”.

In short, $\tau \sim p_\theta(\tau)$ means “A specific trajectory $\tau$ is sampled from the distribution of all possible trajectories defined by $p_{\theta}$ .”

Calculating the Gradient of $J(\theta)$

Because expected values represent a statistical average over a continuous space of trajectories, we can express $J(\theta)$ formally using an integral:

J(\theta) = \int p_\theta(\tau)r(\tau)d\tau.

We compute the gradient of $J(\theta)$ with respect to $\theta$ by taking the derivative of the integral. Assuming we can pass the gradient operator inside the integral, we see that:

\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta \left( \int p_\theta(\tau)r(\tau)d\tau \right) \\ & = \int \nabla_\theta p_\theta(\tau)r(\tau)d\tau \end{align}

Remark

Note that $\nabla_\theta$ is an operator, not a variable. The expression $\nabla_\theta J(\theta)$ should be interpreted as applying the gradient operator to $J(\theta)$ , not as multiplying $\nabla_\theta$ by $J(\theta)$ .

To evaluate this integral numerically, one might initially consider Riemann sum. In a standard Riemann sum for an integral $\displaystyle \int f(x)dx$ , we evaluate the function at specific points and multiply by a small width $\Delta x$ .

Applied to our trajectory space, we would slice the space of $\tau$ into small chunks $\Delta\tau$ , yielding:

\lim _{n\rightarrow \infty } \sum _{i=1}^{n} \nabla_\theta p_\theta(\tau_i)r(\tau_i) \Delta\tau

However, the space of possible trajectories is infinitely vast (encompassing all possible joint positions, velocities, and timestamps). Therefore, evaluating a Riemann sum over this space is computationally infeasible.

Monte Carlo Sampling

Because we cannot evaluate the integral analytically or via a Riemann sum, we turn to Monte Carlo sampling. To use Monte Carlo estimation, our integral must be structured as an expected value:

\int \textbf{[Valid Probability Distribution]} \times \text{[Some Value]} \, dx

To illustrate, suppose we want to find the expected value of the function $f(x) = x^2$ , where $x$ is uniformly distributed between 0 and 1. Here, our valid probability distribution is the probability density function $p(x) = 1$ for $x \in [0, 1]$ , illustrated below.

Why is that

exactly a constant 1?
This brings us back to how is probability density function defined. A probability function satisfies $\displaystyle p(-\infty < x < \infty) = \int _{-\infty }^{\infty } p(x)\, dx =1$ . The $\displaystyle p(a \leq x \leq b)=\int _a^b p(x)\, dx$ is one of the special case. In our scenario, the total area under the curve where $0\leq x \leq 1$ equals 1 which satisfies the 100% probability rule $\displaystyle \int_{0}^{1} p(x) dx=1$ .

Formally, we write this expected value as:

E_{x \sim p(x)}[x^2] = \int_{0}^{1} p(x) \cdot x^2 dx.

Because $p(x) = 1$ , the exact analytical solution is:

\int_{0}^{1} 1 \cdot x^2 dx = \left[ \frac{x^3}{3} \right]_{0}^{1} = \frac{1}{3} \approx 0.333.

If the function were too complex to integrate analytically, Monte Carlo sampling allows us to approximate the expected value by simulating it. We sample $N$ random values from our distribution and average the results:

E_{x \sim p(x)}[x^2] \approx \frac{1}{N} \sum_{i=1}^{N} x_i^2.

Here is a brief Python script demonstrating this approximation:

import random
from collections.abc import Callable
 
def square(x: float) -> float:
    return x**2
 
def uniform_sample(lower: float, upper: float) -> float:
    return random.uniform(lower, upper)
 
def monte_carlo_sampling(
    sampler: Callable[[float, float], float],
    value: Callable[[float], float],
    sample_count: int,
) -> float:
    total = 0.0
    for _ in range(sample_count):
        x = sampler(0.0, 1.0)
        total += value(x)
    return total / sample_count
 
print(monte_carlo_sampling(uniform_sample, square, 1_000_000))

Running this code yields a result very close to $0.333$ , demonstrating the power of the Monte Carlo method.

The Log Derivative Trick

Returning to our policy gradient, we encounter a problem. Our current gradient expression is:

\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau

This does not fit the Monte Carlo template because $\textcolor{red}{\nabla_\theta p_\theta(\tau)}$ is a gradient, not a valid probability distribution. To resolve this, we utilize a calculus identity known as the log-derivative trick.

Recall that by the chain rule, the derivative of the natural logarithm of a function $f(x)$ is:

\frac{d}{dx} \log(f(x)) = \frac{f'(x)}{f(x)}.

If we substitute $\theta$ for $x$ and $p_\theta(\tau)$ for $f(x)$ , we obtain:

\nabla_\theta \log p_\theta(\tau) = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}.

Multiplying both sides by $p_\theta(\tau)$ yields our identity:

\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\nabla_\theta \log p_\theta(\tau).

We can now substitute this identity back into our gradient integral:

\begin{align} \nabla J(\theta) &= \int \underbrace{\textcolor{red}{\nabla_\theta p_\theta(\tau)}}_{\text{not valid}} \,\, \underbrace{r(\tau)}_{\text{value}} d\tau \\ &= \int \underbrace{p_\theta(\tau)}_{\text{valid probability}} \underbrace{\nabla_\theta \log p_\theta(\tau) \,\, r(\tau)}_{\text{value}} \,\,d\tau . \end{align}

Because the integral is now the product of a valid probability distribution ( $p_\theta(\tau)$ ) and a specific value ( $\nabla_\theta \log p_\theta(\tau) r(\tau)$ ), we can rewrite it as an expected value:

\nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) r(\tau) \right].

Expanding the Probability of a Trajectory

Next, we must expand the $\log p_\theta(\tau)$ term. The probability of a specific trajectory $\tau$ occurring is the product of the initial state probability and the probabilities of all subsequent actions and state transitions:

p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t).

Consider the the product rule of logarithm gives:

\log _{b}(xy)=\log _{b}x+\log _{b}y,

we know that taking the natural logarithm of both sides transforms the products into sums:

\begin{align} \log \left(p_\theta(\tau)\right) &= \log \left( p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \right)\\ &= \log p(\mathbf{s}_1) + \sum_{t=1}^T \Big( \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \Big)\\ &= \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t). \end{align}

We then apply the gradient with respect to $\theta$ to this entire expression:

\nabla_\theta \log p_\theta(\tau) = \nabla_\theta \left( \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \right).

Crucially, the terms $p(\mathbf{s}_1)$ and $p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$ represent the dynamics of the environment. Because they do not depend on our network parameters $\theta$ , their gradients with respect to $\theta$ are zero. This allows us to cancel them out:

\begin{align} \nabla_\theta \log \left(p_\theta(\tau)\right) &= \nabla_\theta \left(\log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\right) \\ &= \nabla_\theta \left(\cancel{\log p(\mathbf{s}_1)} + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \cancel{\sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)}\right) \\ &= \nabla_\theta \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t), \end{align}

leaving

\nabla_\theta \log p_\theta(\tau) = \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t).

Why do we need to expand the

into $\log _{b}x+\log _{b}y$ ?
Proof
Assume for the sake of contradiction that it is viable to evaluate the objective without expanding the $\log$ , meaning we directly compute the $\log$ of trajectory probability $\displaystyle p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$ .
Let $T$ be a large number of timesteps, such as $T=500$ , and let the policy be confident, assigning a probability of $0.9$ to each action. The trajectory probability will contain the product $0.9^{500}$ , which is around $1.3\times 10^{-23}$ . In a standard 32-bit floating point architecture, values like this cannot be represented and underflow to $0.0$ .
We have reached a contradiction, so our assumption was wrong. Therefore, we must expand the $\log$ . By doing so, the product transforms into a sum of logarithms: $\displaystyle \log p_\theta(\tau) = \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)$ . This evaluates to a sum of manageable negative values, safely bypassing numerical underflow. ▪️

The Final Gradient of $J(\theta)$

Finally, we substitute our expanded $\nabla_\theta \log p_\theta(\tau)$ expression and the expanded reward $r(\tau)$ back into our expected value equation.

\begin{align} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) r(\tau) \right] \tag{1}\\ \nabla_\theta \log p_\theta(\tau) &= \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \tag{2}\\ r(\tau) &= \sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \tag{3} \end{align}

This gives us the final, computable form of the policy gradient:

\nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \right) \left( \sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \right) \right].

How to Derive the Policy Gradient with Monte Carlo Sampling?

A Review of the Objective

Calculating the Gradient of J(θ)J(\theta)J(θ)

Monte Carlo Sampling

The Log Derivative Trick

Expanding the Probability of a Trajectory

The Final Gradient of J(θ)J(\theta)J(θ)

See also...

Calculating the Gradient of $J(\theta)$

The Final Gradient of $J(\theta)$