Logo Xingxin on Bug

How to Derive the Policy Gradient with Monte Carlo Sampling?

February 28, 2026
8 min read

A Review of the Objective

Let θ\theta represent the parameters of our neural network, and let τ\tau represent a specific trajectory sampled from the probability distribution of all possible trajectories, denoted by pθ(τ)p_\theta(\tau). Our goal is to find the optimal parameter θ\theta^\star that maximize our expected value.

We define our objective function, J(θ)J(\theta), as the expected reward over these trajectories:

J(θ)=Eτpθ(τ)[r(τ)].J(\theta) = E_{\tau \sim p_\theta(\tau)}[r(\tau)].

To maximize this expected value, we need to find the gradient of J(θ)J(\theta) with respect to θ\theta. We will show that this gradient can be expressed in a form that allows for estimation via Monte Carlo sampling.

Remark

A brief note on the notation τpθ(τ)\tau \sim p_\theta(\tau):

  • The term pθ(τ)p_\theta(\tau) defines the probability distribution over all possible trajectories, parameterized by our network weights θ\theta.
  • The variable τ\tau refers to one specific, concrete trajectory.
  • The symbol \sim means “is sampled from”.

In short, τpθ(τ)\tau \sim p_\theta(\tau) means “A specific trajectory τ\tau is sampled from the distribution of all possible trajectories defined by pθp_{\theta}.”

Calculating the Gradient of J(θ)J(\theta)

Because expected values represent a statistical average over a continuous space of trajectories, we can express J(θ)J(\theta) formally using an integral:

J(θ)=pθ(τ)r(τ)dτ.J(\theta) = \int p_\theta(\tau)r(\tau)d\tau.

We compute the gradient of J(θ)J(\theta) with respect to θ\theta by taking the derivative of the integral. Assuming we can pass the gradient operator inside the integral, we see that:

θJ(θ)=θ(pθ(τ)r(τ)dτ)=θpθ(τ)r(τ)dτ\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta \left( \int p_\theta(\tau)r(\tau)d\tau \right) \\ & = \int \nabla_\theta p_\theta(\tau)r(\tau)d\tau \end{align}

Remark

Note that θ\nabla_\theta is an operator, not a variable. The expression θJ(θ)\nabla_\theta J(\theta) should be interpreted as applying the gradient operator to J(θ)J(\theta), not as multiplyingθ\nabla_\theta by J(θ)J(\theta).

To evaluate this integral numerically, one might initially consider Riemann sum. In a standard Riemann sum for an integral f(x)dx\displaystyle \int f(x)dx, we evaluate the function at specific points and multiply by a small width Δx\Delta x.

images_defint_ci2.svg

Applied to our trajectory space, we would slice the space of τ\tau into small chunks Δτ\Delta\tau, yielding:

limni=1nθpθ(τi)r(τi)Δτ\lim _{n\rightarrow \infty } \sum _{i=1}^{n} \nabla_\theta p_\theta(\tau_i)r(\tau_i) \Delta\tau

However, the space of possible trajectories is infinitely vast (encompassing all possible joint positions, velocities, and timestamps). Therefore, evaluating a Riemann sum over this space is computationally infeasible.

Monte Carlo Sampling

Because we cannot evaluate the integral analytically or via a Riemann sum, we turn to Monte Carlo sampling. To use Monte Carlo estimation, our integral must be structured as an expected value:

[Valid Probability Distribution]×[Some Value]dx\int \textbf{[Valid Probability Distribution]} \times \text{[Some Value]} \, dx

To illustrate, suppose we want to find the expected value of the function f(x)=x2f(x) = x^2, where xx is uniformly distributed between 0 and 1. Here, our valid probability distribution is the probability density function p(x)=1p(x) = 1 for x[0,1]x \in [0, 1], illustrated below.

pdf-x-0-1.webp

Why is that

exactly a constant 1?
This brings us back to how is probability density function defined. A probability function satisfies p(<x<)=p(x)dx=1\displaystyle p(-\infty < x < \infty) = \int _{-\infty }^{\infty } p(x)\, dx =1. The p(axb)=abp(x)dx\displaystyle p(a \leq x \leq b)=\int _a^b p(x)\, dx is one of the special case. In our scenario, the total area under the curve where 0x10\leq x \leq 1 equals 1 which satisfies the 100% probability rule 01p(x)dx=1\displaystyle \int_{0}^{1} p(x) dx=1.

Formally, we write this expected value as:

Exp(x)[x2]=01p(x)x2dx.E_{x \sim p(x)}[x^2] = \int_{0}^{1} p(x) \cdot x^2 dx.

Because p(x)=1p(x) = 1, the exact analytical solution is:

011x2dx=[x33]01=130.333.\int_{0}^{1} 1 \cdot x^2 dx = \left[ \frac{x^3}{3} \right]_{0}^{1} = \frac{1}{3} \approx 0.333.

If the function were too complex to integrate analytically, Monte Carlo sampling allows us to approximate the expected value by simulating it. We sample NN random values from our distribution and average the results:

Exp(x)[x2]1Ni=1Nxi2.E_{x \sim p(x)}[x^2] \approx \frac{1}{N} \sum_{i=1}^{N} x_i^2.

Here is a brief Python script demonstrating this approximation:

import random
from collections.abc import Callable
 
def square(x: float) -> float:
    return x**2
 
def uniform_sample(lower: float, upper: float) -> float:
    return random.uniform(lower, upper)
 
def monte_carlo_sampling(
    sampler: Callable[[float, float], float],
    value: Callable[[float], float],
    sample_count: int,
) -> float:
    total = 0.0
    for _ in range(sample_count):
        x = sampler(0.0, 1.0)
        total += value(x)
    return total / sample_count
 
print(monte_carlo_sampling(uniform_sample, square, 1_000_000))

Running this code yields a result very close to 0.3330.333, demonstrating the power of the Monte Carlo method.

The Log Derivative Trick

Returning to our policy gradient, we encounter a problem. Our current gradient expression is:

θJ(θ)=θpθ(τ)r(τ)dτ\nabla_\theta J(\theta) = \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau

This does not fit the Monte Carlo template because θpθ(τ)\textcolor{red}{\nabla_\theta p_\theta(\tau)} is a gradient, not a valid probability distribution. To resolve this, we utilize a calculus identity known as the log-derivative trick.

Recall that by the chain rule, the derivative of the natural logarithm of a function f(x)f(x) is:

ddxlog(f(x))=f(x)f(x).\frac{d}{dx} \log(f(x)) = \frac{f'(x)}{f(x)}.

If we substitute θ\theta for xx and pθ(τ)p_\theta(\tau) for f(x)f(x), we obtain:

θlogpθ(τ)=θpθ(τ)pθ(τ).\nabla_\theta \log p_\theta(\tau) = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}.

Multiplying both sides by pθ(τ)p_\theta(\tau) yields our identity:

θpθ(τ)=pθ(τ)θlogpθ(τ).\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\nabla_\theta \log p_\theta(\tau).

We can now substitute this identity back into our gradient integral:

J(θ)=θpθ(τ)not validr(τ)valuedτ=pθ(τ)valid probabilityθlogpθ(τ)r(τ)valuedτ.\begin{align} \nabla J(\theta) &= \int \underbrace{\textcolor{red}{\nabla_\theta p_\theta(\tau)}}_{\text{not valid}} \,\, \underbrace{r(\tau)}_{\text{value}} d\tau \\ &= \int \underbrace{p_\theta(\tau)}_{\text{valid probability}} \underbrace{\nabla_\theta \log p_\theta(\tau) \,\, r(\tau)}_{\text{value}} \,\,d\tau . \end{align}

Because the integral is now the product of a valid probability distribution (pθ(τ)p_\theta(\tau)) and a specific value (θlogpθ(τ)r(τ)\nabla_\theta \log p_\theta(\tau) r(\tau)), we can rewrite it as an expected value:

θJ(θ)=Eτpθ(τ)[θlogpθ(τ)r(τ)].\nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) r(\tau) \right].

Expanding the Probability of a Trajectory

Next, we must expand the logpθ(τ)\log p_\theta(\tau) term. The probability of a specific trajectory τ\tau occurring is the product of the initial state probability and the probabilities of all subsequent actions and state transitions:

pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at).p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t).

Consider the the product rule of logarithm gives:

logb(xy)=logbx+logby,\log _{b}(xy)=\log _{b}x+\log _{b}y,

we know that taking the natural logarithm of both sides transforms the products into sums:

log(pθ(τ))=log(p(s1)t=1Tπθ(atst)p(st+1st,at))=logp(s1)+t=1T(logπθ(atst)+logp(st+1st,at))=logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at).\begin{align} \log \left(p_\theta(\tau)\right) &= \log \left( p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \right)\\ &= \log p(\mathbf{s}_1) + \sum_{t=1}^T \Big( \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \Big)\\ &= \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t). \end{align}

We then apply the gradient with respect to θ\theta to this entire expression:

θlogpθ(τ)=θ(logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at)).\nabla_\theta \log p_\theta(\tau) = \nabla_\theta \left( \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) \right).

Crucially, the terms p(s1)p(\mathbf{s}_1) and p(st+1st,at)p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t) represent the dynamics of the environment. Because they do not depend on our network parameters θ\theta, their gradients with respect to θ\theta are zero. This allows us to cancel them out:

θlog(pθ(τ))=θ(logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at))=θ(logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at))=θt=1Tlogπθ(atst),\begin{align} \nabla_\theta \log \left(p_\theta(\tau)\right) &= \nabla_\theta \left(\log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)\right) \\ &= \nabla_\theta \left(\cancel{\log p(\mathbf{s}_1)} + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \cancel{\sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t)}\right) \\ &= \nabla_\theta \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t), \end{align}

leaving

θlogpθ(τ)=t=1Tθlogπθ(atst).\nabla_\theta \log p_\theta(\tau) = \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t).

Why do we need to expand the

into logbx+logby\log _{b}x+\log _{b}y?
Proof
Assume for the sake of contradiction that it is viable to evaluate the objective without expanding the log\log, meaning we directly compute the log\log of trajectory probability pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)\displaystyle p_\theta(\tau) = p(\mathbf{s}_1) \prod_{t=1}^T \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t).

Let TT be a large number of timesteps, such as T=500T=500, and let the policy be confident, assigning a probability of 0.90.9 to each action. The trajectory probability will contain the product 0.95000.9^{500}, which is around 1.3×10231.3\times 10^{-23}. In a standard 32-bit floating point architecture, values like this cannot be represented and underflow to 0.00.0.

We have reached a contradiction, so our assumption was wrong. Therefore, we must expand the log\log. By doing so, the product transforms into a sum of logarithms: logpθ(τ)=logp(s1)+t=1Tlogπθ(atst)+t=1Tlogp(st+1st,at)\displaystyle \log p_\theta(\tau) = \log p(\mathbf{s}_1) + \sum_{t=1}^T \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) + \sum_{t=1}^T \log p(\mathbf{s}_{t+1} | \mathbf{s}_t, \mathbf{a}_t). This evaluates to a sum of manageable negative values, safely bypassing numerical underflow. ▪️

The Final Gradient of J(θ)J(\theta)

Finally, we substitute our expanded θlogpθ(τ)\nabla_\theta \log p_\theta(\tau) expression and the expanded reward r(τ)r(\tau) back into our expected value equation.

θJ(θ)=Eτpθ(τ)[θlogpθ(τ)r(τ)]θlogpθ(τ)=t=1Tθlogπθ(atst)r(τ)=t=1Tr(st,at)\begin{align} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) r(\tau) \right] \tag{1}\\ \nabla_\theta \log p_\theta(\tau) &= \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \tag{2}\\ r(\tau) &= \sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \tag{3} \end{align}

This gives us the final, computable form of the policy gradient:

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))].\nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)} \left[ \left( \sum_{t=1}^T \nabla_\theta \log \pi_\theta(\mathbf{a}_t | \mathbf{s}_t) \right) \left( \sum_{t=1}^T r(\mathbf{s}_t, \mathbf{a}_t) \right) \right].

See also...