How to Compute a Value Function? Policy Evaluation

In a previous blog post, Value Function: From Gridworld to Bellman Equation, I used GridWorld from Sutton and Barto’s Reinforcement Learning: An Introduction as an example to introduce what the value function is. In this blog post, I want to go one step further and discuss computation: how do we actually calculate the value function?

Compared with with Sutton and Barto’s book, I found Kochenderfer’s Algorithms for Decision Making especially helpful for this topic. Maybe this is because it includes julia code, which makes the computation easier to connect with the math.

In this blog post, I’ll explain policy evaluation based on my understanding after reading the material.

What is Policy Evaluation?

The policy evaluation is the process of calculating the value function of a given (fixed) policy $\pi$ .

Tip

Given a policy, policy evaluation asks: if I keep following this policy, how good is each state?

Assumption

The assumptions in this post are:

the model is known,
the environment is fully observable. (i.e., we know the current state $s$ )
the state space is finite and discrete,
the action space is finite and discrete,
the state transition model $T$ is known,
the reward model $R$ is known,
the policy $\pi$ is fixed,
the discount factor satisfies $0\leq\gamma1$ .

I will use the reward convention from the book:

R(s,a).

It means the expected immediate reward after taking action $a$ in state $s$ .

Notation

I’ll follow the notation convention in ADM:

$U$ refers to utility, or value
$R$ refers to reward function
$T$ refers to state-transition probability function
$\gamma$ refers to the discount factor

Remark

In this post, I mainly use a deterministic policy, so $\pi(s)$ directly returns an action. If $\pi$ is stochastic, i.e. $\pi(a \mid s)$ , then we need to average over actions. For ease of reading, we assume $\pi$ is deterministic.

Solution 1: Iterative Approach

If the policy is executed for only one step, the utility is

U^{\pi}_1(s) = R(s,\pi(s)).

This is easy to understand. If the agent moves only 1 step, we only care about the immediate reward. Given the current state $s$ , the policy chooses action $\pi(s)$ , and the reward is $R(s,\pi(s)).$

For more steps, we can use the recursive update:

U_{k+1}^{\pi} (s) = \underbrace{R(s,\pi(s))}_\text{immediate reward} + \underbrace{\gamma \sum_{s'} T(s' | s,\pi(s)) U_{k}^{\pi}(s')}_\text{discounted future reward}.

^c76ebd

This equation is called the lookahead. Its equivalent julia code is:

function lookahead(𝒫::MDP, U, s, a)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    return R(s,a) + γ*sum(T(s,a,s′)*U(s′) for s′ in 𝒮)
end
function lookahead(𝒫::MDP, U::Vector, s, a)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    return R(s,a) + γ*sum(T(s,a,s′)*U[i] for (i,s′) in enumerate(𝒮))
end

Remark

Treating U as a function is closer to the mathematical definition. Treating U as a vector is easier to compute in discrete, finite environments like GridWorld.
In short,

$U$ -as-function is math-friendly.

$U$ -as-vector is implementation-friendly.

After enough iteration, $U^\pi_k$ converges to the value function $U^\pi$ due to contraction mapping(guaranteed to converge). At convergence, the old value and the new value are the same:

U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

This equation is called Bellman expectation equation, and its equivalent code is

function iterative_policy_evaluation(𝒫::MDP, π, k_max)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    U = [0.0 for s in 𝒮]
    for k in 1:k_max
        U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
    end
    return U
end

Remark

In many literature, the Bellman equation is loosely called. Depending on the context, it might be referred to

Bellman expectation equation: evaluate a fixed policy $\pi$ . This is for policy evaluation.

Bellman optimality equation: choose best action by taking a max over action. This is for control / optimization. ^457806

Remark

I want to add some notes on contraction mapping. A function $f$ is a contraction mapping if $d(f(x), f(y)) \le k \cdot d(x, y)$ where $0\leq k<1$ .
This means that the distance between the 2 outputs $f(x)$ and $f(y)$ is smaller than the distance between the 2 inputs $x$ and $y$ .
Intuitively, if we repeatedly apply a contraction mapping $f(f(f(x)))\dots$ , the result is guaranteed to move toward a fixed point where $f(x)=x.$
In the code U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮], the old value U goes into the Bellman update, and the new value U comes out. Repeating this process moves U closer to the true value function.

Solution 2: System of Equations

ADM points out that policy evaluation can also be done without iteration by solving the Bellman expectation equation directly as system of equations. The matrix form is:

\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}.

First, the vector $\mathbf{U}^{\pi}$ contains the value of every state, we have $|\mathcal{S}|$ states. If the states are $s_1,\dots,s_n$ , then:

\mathbf{U}^{\pi} = \begin{bmatrix} U^{\pi}(s_1) \\ U^{\pi}(s_2) \\ \vdots \\ U^{\pi}(s_n) \end{bmatrix}.

Remark

The $|\mathcal{S}|$ denotes the cardinal number of the state set. In this case, $|\mathcal{S}|=n$ .

The vector $\mathbf{R}^{\pi}$ is the reward vector under policy $\pi$ :

\mathbf{R}^{\pi} = \begin{bmatrix} R(s_1, \pi(s_1)) \\ R(s_2, \pi(s_2)) \\ \vdots \\ R(s_n, \pi(s_n)) \end{bmatrix}.

Remark

Because the policy is fixed, each $\pi(s_i)$ gives one specific action.

The matrix $\mathbf{T}^\pi$ is an $n \times n$ square matrix. Its entry in row $i$ and column $j$ is:

T_{ij}^\pi=T(s_j \mid s_i,\pi(s_i)).

This is the probability of moving from state $s_i$ to state $s_j$ when following policy $\pi$ .

Remark

Note that the $\mathbf{T}^\pi$ is a stochastic matrix. Each row is a probability distribution, so each row sums to 1.

Remark

If the policy is stochastic, then the same matrix form still works. We only need to define: $R_i^\pi=\sum_a \pi(a\mid s_i)R(s_i,a),$ and $T_{ij}^\pi = \sum_a \pi(a \mid s_i) T(s_j \mid s_i, a).$

Now all notation is ready. Let’s unpack the matrix equation:

\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}.

Recall the scalar Bellman expectation equation:

U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

The matrix-vector multiplication $\mathbf{T}^{\pi} \mathbf{U}^{\pi}$ is exactly the vector form of:

\sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

Suppose the diagram uses 3 states, i.e., $|\mathcal{S}|=3$ :

S=\Set{s_0​,s_1​,s_2​}.

Then:

\mathbf{U}^\pi= \begin{bmatrix}U_0 \\ U_1 \\ U_2\end{bmatrix},

where $U_i = U^\pi(s_i)$ .

The transition matrix $\mathbf{T}^\pi$ is a $3\times3$ square matrix:

\mathbf{T}^{\pi}= \begin{bmatrix} T_{s_0s_0} & T_{s_0s_1} & T_{s_0s_2}\\ T_{s_1s_0} & T_{s_1s_1} & T_{s_1s_2}\\ T_{s_2s_0} & T_{s_2s_1} & T_{s_2s_2} \end{bmatrix}.

Here, $T_{s_is_j}=T(s_j\mid s_i,\pi(s_i)).$

The symbol $\displaystyle \sum_{s'}$ means we enumerates all possible next states. This is why the matrix-vector product uses a $3$ -vector. For the row corresponding to current state $s_1$ , we get

(\mathbf{T}^{\pi}\mathbf{U}^{\pi})_{s_1} = T_{s_1s_0}U_0 + T_{s_1s_1}U_1 + T_{s_1s_2}U_2.

Therefore

U(s_1) = R(s_1,\pi(s_1)) + \gamma \underbrace{\Big( T_{s_1 s_0} U_0 + T_{s_1 s_1} U_1 + T_{s_1 s_2} U_2 \Big)}_{\text{shown in diagram}}.

Using Strang’s favorite “column picture”, we can have the following diagrams help us understand the matrix-vector product better.

So the matrix equation

\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}

is simply the vectorized form of

U^{\pi} (s) = R(s,\pi(s)) + \gamma \sum_{s'} T(s' | s,\pi(s)) U^{\pi}(s').

Now, how do we solve it?

Start from:

\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}

Rearrange to isolate $\mathbf{U}^{\pi}$ :

$\mathbf{U}^{\pi} - \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi} = \mathbf{R}^{\pi}.$

Factor out $\mathbf{U}^{\pi}$ ( $\mathbf{I}$ is the $n \times n$ identity matrix):

$(\mathbf{I} - \gamma \mathbf{T}^{\pi}) \mathbf{U}^{\pi} = \mathbf{R}^{\pi}.$

Finally, multiply both sides by the inverse:

$\mathbf{U}^{\pi} = (\mathbf{I} - \gamma \mathbf{T}^{\pi})^{-1} \mathbf{R}^{\pi}.$

In code, however, we usually should not compute the inverse directly. It is better to solve the linear system:

function policy_evaluation(𝒫::MDP, π)
	𝒮, R, T, γ = 𝒫.𝒮, 𝒫.R, 𝒫.T, 𝒫.γ
	R′ = [R(s, π(s)) for s in 𝒮]
	T′ = [T(s, π(s), s′) for s in 𝒮, s′ in 𝒮]
	return (I - γ*T′)\R′
end

Remark

The backslash operator \ solves the linear system directly.

Supplement Question

Now we know the “what”. But let’s ask “why”.

Why we can write:

\mathbf{U}^{\pi} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^{\pi}?

At first, this may feel confusing because $\mathbf{U}^{\pi}$ appears on both sides. But in the iterative code, we have:

function iterative_policy_evaluation(𝒫::MDP, π, k_max)
    𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
    U = [0.0 for s in 𝒮]
    for k in 1:k_max
        U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
    end
    return U
end

So it looks like there are 2 different ideas:

iterative update: old $U$ goes in, new $U$ comes out
matrix equation: the same $U^\pi$ appears on both sides

How should we understand this?🤔

The key idea is that the matrix equation describes the fixed point of the iterative process.

The iterative method is

\mathbf{U}_{k+1} = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}_k.

Define the Bellman expectation operator $\mathcal{B}_{\pi}$ as:

\mathcal{B}_{\pi}(\mathbf{U}) = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}.

At convergence, the value no longer changes:

\mathbf{U}_{k+1} = \mathbf{U}_k = \mathbf{U}^{\pi}.

So the fixed-point equation becomes

\mathbf{U}^\pi = \mathcal{B}_{\pi}(\mathbf{U}^\pi),

which is exactly

\mathbf{U}^\pi = \mathbf{R}^{\pi} + \gamma \mathbf{T}^{\pi} \mathbf{U}^\pi.

Tip

$\mathcal{B}_{\pi}$ is a contraction mapping in the $\ell_\infty$ norm with Lipschitz constant $\gamma < 1$ .

More explicitly:

$\|\mathcal{B}_{\pi}(\mathbf{U}) - \mathcal{B}_{\pi}(\mathbf{V})\|_\infty \;\leq\; \gamma \|\mathbf{U} - \mathbf{V}\|_\infty.$

Why?

\begin{align} \|\mathcal{B}_{\pi}(\mathbf{U}) - \mathcal{B}_{\pi}(\mathbf{V})\|_\infty &= \| \gamma \mathbf{T}^\pi (\mathbf{U} - \mathbf{V}) \| _\infty \\ &\leq\; \gamma \|\mathbf{U} - \mathbf{V}\|_\infty . \end{align}

The inequality holds because each row of $\mathbf{T}^\pi$ is a probability distribution.

Remark

By the Banach fixed-point theorem, the fixed point exists, is unique, and the iterative methods converges to it.

Now let’s connect this with the matrix solution.

Because $\mathbf{T}^\pi$ is stochastic matrix, its eigenvalues have magnitude at most 1. Since $0\leq\gamma<1$ , the eigenvalues of $\gamma\mathbf{T}^\pi$ have magnitude strictly less than 1.

Therefore, the Neumann series converges:

(\mathbf{I}−\gamma\mathbf{T}^\pi)^{−1}=\sum_{m=0}^\infty​(\gamma\mathbf{T}^\pi)^m.

So:

\mathbf{U}^{\pi} = (\mathbf{I} - \gamma \mathbf{T}^{\pi})^{-1} \mathbf{R}^{\pi} = \sum_{m=0}^\infty​(\gamma\mathbf{T}^\pi)^m\mathbf{R}^\pi.

This also explains the iterative method.

If we start with ( $\mathbf{U}_0=\mathbf{0}$ ), then:

\begin{align} \mathbf{U}_1 &= \mathbf{R}^{\pi}, \\ \mathbf{U}_2 &= \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi}, \\ \mathbf{U}_3 &= \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi} + (\gamma\mathbf{T}^{\pi})^2\mathbf{R}^{\pi}. \end{align}

After many iterations, we get:

\mathbf{U}^\pi = \mathbf{R}^{\pi} + \gamma\mathbf{T}^{\pi}\mathbf{R}^{\pi} + (\gamma\mathbf{T}^{\pi})^2\mathbf{R}^{\pi} + \cdots .

That is the same infinite series as the matrix inverse solution!

So the 2 methods are not contradictory:

iterative policy evaluation approaches the fixed point step by step
the linear-system method solves the fixed point directly

To conclude,

Iterative policy evaluation is fixed-point iteration for the Bellman expectation equation.
Exact policy evaluation solves the same fixed point as a linear system.

How to Compute a Value Function? Policy Evaluation

What is Policy Evaluation?

Assumption

Notation

Solution 1: Iterative Approach

Solution 2: System of Equations

Supplement Question

See also...