Logo Xingxin on Bug

How to Compute a Value Function? Policy Evaluation

June 29, 2026
10 min read

In a previous blog post, Value Function: From Gridworld to Bellman Equation, I used GridWorld from Sutton and Barto’s Reinforcement Learning: An Introduction as an example to introduce what the value function is. In this blog post, I want to go one step further and discuss computation: how do we actually calculate the value function?

Compared with with Sutton and Barto’s book, I found Kochenderfer’s Algorithms for Decision Making especially helpful for this topic. Maybe this is because it includes julia code, which makes the computation easier to connect with the math.

In this blog post, I’ll explain policy evaluation based on my understanding after reading the material.


What is Policy Evaluation?

The policy evaluation is the process of calculating the value function of a given (fixed) policy π\pi.

Tip

Given a policy, policy evaluation asks: if I keep following this policy, how good is each state?

Assumption

The assumptions in this post are:

I will use the reward convention from the book:

R(s,a).R(s,a).

It means the expected immediate reward after taking action aa in state ss.

Notation

I’ll follow the notation convention in ADM:

  • UU refers to utility, or value
  • RR refers to reward function
  • TT refers to state-transition probability function
  • γ\gamma refers to the discount factor
Remark

In this post, I mainly use a deterministic policy, so π(s)\pi(s) directly returns an action. If π\pi is stochastic, i.e.π(as)\pi(a \mid s), then we need to average over actions. For ease of reading, we assume π\pi is deterministic.

Solution 1: Iterative Approach

If the policy is executed for only one step, the utility is

U1π(s)=R(s,π(s)).U^{\pi}_1(s) = R(s,\pi(s)).

This is easy to understand. If the agent moves only 1 step, we only care about the immediate reward. Given the current state ss, the policy chooses action π(s)\pi(s), and the reward is R(s,π(s)).R(s,\pi(s)).


For more steps, we can use the recursive update:

Uk+1π(s)=R(s,π(s))immediate reward+γsT(ss,π(s))Ukπ(s)discounted future reward.U_{k+1}^{\pi} (s) = \underbrace{R(s,\pi(s))}_\text{immediate reward} + \underbrace{\gamma \sum_{s'} T(s' | s,\pi(s)) U_{k}^{\pi}(s')}_\text{discounted future reward}.

^c76ebd

This equation is called the lookahead. Its equivalent julia code is:

function lookahead(𝒫::MDP, U, s, a)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    return R(s,a) + γ*sum(T(s,a,s′)*U(s′) for s′ in 𝒮)
end
function lookahead(𝒫::MDP, U::Vector, s, a)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    return R(s,a) + γ*sum(T(s,a,s′)*U[i] for (i,s′) in enumerate(𝒮))
end
Remark

Treating U as a function is closer to the mathematical definition. Treating U as a vector is easier to compute in discrete, finite environments like GridWorld.

In short,

  • UU-as-function is math-friendly.
  • UU-as-vector is implementation-friendly.

After enough iteration, UkπU^\pi_k converges to the value function UπU^\pi due to contraction mapping(guaranteed to converge). At convergence, the old value and the new value are the same:

Uπ(s)=R(s,π(s))+γsT(ss,π(s))Uπ(s).U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

This equation is called Bellman expectation equation, and its equivalent code is

function iterative_policy_evaluation(𝒫::MDP, π, k_max)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    U = [0.0 for s in 𝒮]
    for k in 1:k_max
        U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
    end
    return U
end
Remark

In many literature, the Bellman equation is loosely called. Depending on the context, it might be referred to

Remark

I want to add some notes on contraction mapping. A function ff is a contraction mapping if d(f(x),f(y))kd(x,y)d(f(x), f(y)) \le k \cdot d(x, y) where 0k<10\leq k<1.

This means that the distance between the 2 outputs f(x)f(x) and f(y)f(y) is smaller than the distance between the 2 inputs xx and yy.

Intuitively, if we repeatedly apply a contraction mapping f(f(f(x)))f(f(f(x)))\dots, the result is guaranteed to move toward a fixed point where f(x)=x.f(x)=x.

In the code U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮], the old value U goes into the Bellman update, and the new value U comes out. Repeating this process moves U closer to the true value function.

Solution 2: System of Equations

ADM points out that policy evaluation can also be done without iteration by solving the Bellman expectation equation directly as system of equations. The matrix form is:

Uπ=Rπ+γTπUπ.\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}.

First, the vector Uπ\mathbf{U}^{\pi} contains the value of every state, we have S|\mathcal{S}| states. If the states are s1,,sns_1,\dots,s_n, then:

Uπ=[Uπ(s1)Uπ(s2)Uπ(sn)].\mathbf{U}^{\pi} = \begin{bmatrix} U^{\pi}(s_1) \\ U^{\pi}(s_2) \\ \vdots \\ U^{\pi}(s_n) \end{bmatrix}.
Remark

The S|\mathcal{S}| denotes the cardinal number of the state set. In this case, S=n|\mathcal{S}|=n.


The vector Rπ\mathbf{R}^{\pi} is the reward vector under policy π\pi:

Rπ=[R(s1,π(s1))R(s2,π(s2))R(sn,π(sn))].\mathbf{R}^{\pi} = \begin{bmatrix} R(s_1, \pi(s_1)) \\ R(s_2, \pi(s_2)) \\ \vdots \\ R(s_n, \pi(s_n)) \end{bmatrix}.
Remark

Because the policy is fixed, each π(si)\pi(s_i) gives one specific action.


The matrix Tπ\mathbf{T}^\pi is an n×nn \times n square matrix. Its entry in row ii and column jj is:

Tijπ=T(sjsi,π(si)).T_{ij}^\pi=T(s_j \mid s_i,\pi(s_i)).

This is the probability of moving from state sis_i​ to state sjs_j when following policy π\pi.

Remark

Note that the Tπ\mathbf{T}^\pi is a stochastic matrix. Each row is a probability distribution, so each row sums to 1.

Remark

If the policy is stochastic, then the same matrix form still works. We only need to define: Riπ=aπ(asi)R(si,a),R_i^\pi​=\sum_a \pi(a\mid s_i​)R(s_i​,a), and Tijπ=aπ(asi)T(sjsi,a).T_{ij}^\pi = \sum_a \pi(a \mid s_i) T(s_j \mid s_i, a).


Now all notation is ready. Let’s unpack the matrix equation:

Uπ=Rπ+γTπUπ.\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}.

Recall the scalar Bellman expectation equation:

Uπ(s)=R(s,π(s))+γsT(ss,π(s))Uπ(s).U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

The matrix-vector multiplication TπUπ\mathbf{T}^{\pi} \mathbf{U}^{\pi} is exactly the vector form of:

sT(ss,π(s))Uπ(s).\sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

Suppose the diagram uses 3 states, i.e., S=3|\mathcal{S}|=3:

S={s0,s1,s2}.S=\Set{s_0​,s_1​,s_2​}.

Then:

Uπ=[U0U1U2],\mathbf{U}^\pi= \begin{bmatrix}U_0 \\ U_1 \\ U_2\end{bmatrix},

where Ui=Uπ(si)U_i = U^\pi(s_i).

The transition matrix Tπ\mathbf{T}^\pi is a 3×33\times3 square matrix:

Tπ=[Ts0s0Ts0s1Ts0s2Ts1s0Ts1s1Ts1s2Ts2s0Ts2s1Ts2s2].\mathbf{T}^{\pi}= \begin{bmatrix} T_{s_0s_0} & T_{s_0s_1} & T_{s_0s_2}\\ T_{s_1s_0} & T_{s_1s_1} & T_{s_1s_2}\\ T_{s_2s_0} & T_{s_2s_1} & T_{s_2s_2} \end{bmatrix}.

Here, Tsisj=T(sjsi,π(si)).T_{s_is_j}=T(s_j\mid s_i,\pi(s_i)).

The symbol s\displaystyle \sum_{s'} means we enumerates all possible next states. This is why the matrix-vector product uses a 33-vector. For the row corresponding to current state s1s_1, we get

(TπUπ)s1=Ts1s0U0+Ts1s1U1+Ts1s2U2. (\mathbf{T}^{\pi}\mathbf{U}^{\pi})_{s_1} = T_{s_1s_0}U_0 + T_{s_1s_1}U_1 + T_{s_1s_2}U_2.

Therefore

U(s1)=R(s1,π(s1))+γ(Ts1s0U0+Ts1s1U1+Ts1s2U2)shown in diagram.U(s_1) = R(s_1,\pi(s_1)) + \gamma \underbrace{\Big( T_{s_1 s_0} U_0 + T_{s_1 s_1} U_1 + T_{s_1 s_2} U_2 \Big)}_{\text{shown in diagram}}.

Using Strang’s favorite “column picture”, we can have the following diagrams help us understand the matrix-vector product better.

utility-transition-matrix-product-example.svg

So the matrix equation

Uπ=Rπ+γTπUπ\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}

is simply the vectorized form of

Uπ(s)=R(s,π(s))+γsT(ss,π(s))Uπ(s).U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

Now, how do we solve it?

Start from:

Uπ=Rπ+γTπUπ\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}

Rearrange to isolate Uπ\mathbf{U}^{\pi}:

UπγTπUπ=Rπ.\mathbf{U}^{\pi} - \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi} = \mathbf{R}^{\pi}.

Factor out Uπ\mathbf{U}^{\pi} (I\mathbf{I} is the n×nn \times n identity matrix):

(IγTπ)Uπ=Rπ.(\mathbf{I} - \gamma \mathbf{T}^{\pi}) \mathbf{U}^{\pi} = \mathbf{R}^{\pi}.

Finally, multiply both sides by the inverse:

Uπ=(IγTπ)1Rπ.\mathbf{U}^{\pi} = (\mathbf{I} - \gamma \mathbf{T}^{\pi})^{-1} \mathbf{R}^{\pi}.

In code, however, we usually should not compute the inverse directly. It is better to solve the linear system:

function policy_evaluation(𝒫::MDP, π)
	𝒮, R, T, γ = 𝒫.𝒮, 𝒫.R, 𝒫.T, 𝒫.γ
	R′ = [R(s, π(s)) for s in 𝒮]
	T′ = [T(s, π(s), s′) for s in 𝒮, s′ in 𝒮]
	return (I - γ*T′)\R′
end
Remark

The backslash operator \ solves the linear system directly.

Supplement Question

Now we know the “what”. But let’s ask “why”.

Why we can write:

Uπ=Rπ+γTπUπ?\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}?

At first, this may feel confusing because Uπ\mathbf{U}^{\pi} appears on both sides. But in the iterative code, we have:

function iterative_policy_evaluation(𝒫::MDP, π, k_max)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    U = [0.0 for s in 𝒮]
    for k in 1:k_max
        U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
    end
    return U
end

So it looks like there are 2 different ideas:

  1. iterative update: old UU goes in, new UU comes out
  2. matrix equation: the same UπU^\pi appears on both sides

How should we understand this?🤔


The key idea is that the matrix equation describes the fixed point of the iterative process.

The iterative method is

Uk+1=Rπ+γTπUk.\mathbf{U}_{k+1} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}_k.

Define the Bellman expectation operator Bπ\mathcal{B}_{\pi} as:

Bπ(U)=Rπ+γTπU.\mathcal{B}_{\pi}(\mathbf{U}) = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}.

At convergence, the value no longer changes:

Uk+1=Uk=Uπ.\mathbf{U}_{k+1} = \mathbf{U}_k = \mathbf{U}^{\pi}.

So the fixed-point equation becomes

Uπ=Bπ(Uπ),\mathbf{U}^\pi = \mathcal{B}_{\pi}(\mathbf{U}^\pi),

which is exactly

Uπ=Rπ+γTπUπ.\mathbf{U}^\pi = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^\pi.
Tip

Bπ\mathcal{B}_{\pi} is a contraction mapping in the \ell_\infty norm with Lipschitz constant γ<1\gamma < 1.

More explicitly:

Bπ(U)Bπ(V)    γUV.\|\mathcal{B}_{\pi}(\mathbf{U}) - \mathcal{B}_{\pi}(\mathbf{V})\|_\infty \;\leq\; \gamma \|\mathbf{U} - \mathbf{V}\|_\infty.

Why?

Bπ(U)Bπ(V)=γTπ(UV)  γUV.\begin{align} \|\mathcal{B}_{\pi}(\mathbf{U}) - \mathcal{B}_{\pi}(\mathbf{V})\|_\infty &= \| \gamma \mathbf{T}^\pi (\mathbf{U} - \mathbf{V}) \| _\infty \\ &\leq\; \gamma \|\mathbf{U} - \mathbf{V}\|_\infty . \end{align}

The inequality holds because each row of Tπ\mathbf{T}^\pi is a probability distribution.

Remark

By the Banach fixed-point theorem, the fixed point exists, is unique, and the iterative methods converges to it.


Now let’s connect this with the matrix solution.

Because Tπ\mathbf{T}^\pi is stochastic matrix, its eigenvalues have magnitude at most 1. Since 0γ<10\leq\gamma<1, the eigenvalues of γTπ\gamma\mathbf{T}^\pi have magnitude strictly less than 1.

Therefore, the Neumann series converges:

(IγTπ)1=m=0(γTπ)m.(\mathbf{I}−\gamma\mathbf{T}^\pi)^{−1}=\sum_{m=0}^\infty​(\gamma\mathbf{T}^\pi)^m.

So:

Uπ=(IγTπ)1Rπ=m=0(γTπ)mRπ.\mathbf{U}^{\pi} = (\mathbf{I} - \gamma \mathbf{T}^{\pi})^{-1} \mathbf{R}^{\pi} = \sum_{m=0}^\infty​(\gamma\mathbf{T}^\pi)^m\mathbf{R}^\pi.

This also explains the iterative method.

If we start with (U0=0\mathbf{U}_0=\mathbf{0}), then:

U1=Rπ,U2=Rπ+γTπRπ,U3=Rπ+γTπRπ+(γTπ)2Rπ.\begin{align} \mathbf{U}_1 &= \mathbf{R}^{\pi}, \\ \mathbf{U}_2 &= \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi}, \\ \mathbf{U}_3 &= \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi} + (\gamma\mathbf{T}^{\pi})^2\mathbf{R}^{\pi}. \end{align}

After many iterations, we get:

Uπ=Rπ+γTπRπ+(γTπ)2Rπ+.\mathbf{U}^\pi = \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi} + (\gamma\mathbf{T}^{\pi})^2\mathbf{R}^{\pi} + \cdots .

That is the same infinite series as the matrix inverse solution!

So the 2 methods are not contradictory:

  • iterative policy evaluation approaches the fixed point step by step
  • the linear-system method solves the fixed point directly

To conclude,

  • Iterative policy evaluation is fixed-point iteration for the Bellman expectation equation.
  • Exact policy evaluation solves the same fixed point as a linear system.

See also...