In a previous blog post, Value Function: From Gridworld to Bellman Equation, I used GridWorld from Sutton and Barto’s Reinforcement Learning: An Introduction as an example to introduce what the value function is. In this blog post, I want to go one step further and discuss computation: how do we actually calculate the value function?
Compared with with Sutton and Barto’s book, I found Kochenderfer’s Algorithms for Decision Making especially helpful for this topic. Maybe this is because it includes julia code, which makes the computation easier to connect with the math.
In this blog post, I’ll explain policy evaluation based on my understanding after reading the material.
What is Policy Evaluation?
The policy evaluation is the process of calculating the value function of a given (fixed) policy .
TipGiven a policy, policy evaluation asks: if I keep following this policy, how good is each state?
Assumption
The assumptions in this post are:
- the model is known,
- the environment is fully observable. (i.e., we know the current state )
- the state space is finite and discrete,
- the action space is finite and discrete,
- the state transition model is known,
- the reward model is known,
- the policy is fixed,
- the discount factor satisfies .
I will use the reward convention from the book:
It means the expected immediate reward after taking action in state .
Notation
I’ll follow the notation convention in ADM:
- refers to utility, or value
- refers to reward function
- refers to state-transition probability function
- refers to the discount factor
RemarkIn this post, I mainly use a deterministic policy, so directly returns an action. If is stochastic, i.e., then we need to average over actions. For ease of reading, we assume is deterministic.
Solution 1: Iterative Approach
If the policy is executed for only one step, the utility is
This is easy to understand. If the agent moves only 1 step, we only care about the immediate reward. Given the current state , the policy chooses action , and the reward is
For more steps, we can use the recursive update:
^c76ebd
This equation is called the lookahead. Its equivalent julia code is:
function lookahead(𝒫::MDP, U, s, a)
𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
return R(s,a) + γ*sum(T(s,a,s′)*U(s′) for s′ in 𝒮)
end
function lookahead(𝒫::MDP, U::Vector, s, a)
𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
return R(s,a) + γ*sum(T(s,a,s′)*U[i] for (i,s′) in enumerate(𝒮))
endRemarkTreating
Uas a function is closer to the mathematical definition. TreatingUas a vector is easier to compute in discrete, finite environments like GridWorld.In short,
- -as-function is math-friendly.
- -as-vector is implementation-friendly.
After enough iteration, converges to the value function due to contraction mapping(guaranteed to converge). At convergence, the old value and the new value are the same:
This equation is called Bellman expectation equation, and its equivalent code is
function iterative_policy_evaluation(𝒫::MDP, π, k_max)
𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
U = [0.0 for s in 𝒮]
for k in 1:k_max
U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
end
return U
endRemarkIn many literature, the Bellman equation is loosely called. Depending on the context, it might be referred to
- Bellman expectation equation: evaluate a fixed policy . This is for policy evaluation.
- Bellman optimality equation: choose best action by taking a max over action. This is for control / optimization. ^457806
RemarkI want to add some notes on contraction mapping. A function is a contraction mapping if where .
This means that the distance between the 2 outputs and is smaller than the distance between the 2 inputs and .
Intuitively, if we repeatedly apply a contraction mapping , the result is guaranteed to move toward a fixed point where
In the code
U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮], the old valueUgoes into the Bellman update, and the new valueUcomes out. Repeating this process movesUcloser to the true value function.
Solution 2: System of Equations
ADM points out that policy evaluation can also be done without iteration by solving the Bellman expectation equation directly as system of equations. The matrix form is:
First, the vector contains the value of every state, we have states. If the states are , then:
RemarkThe denotes the cardinal number of the state set. In this case, .
The vector is the reward vector under policy :
RemarkBecause the policy is fixed, each gives one specific action.
The matrix is an square matrix. Its entry in row and column is:
This is the probability of moving from state to state when following policy .
RemarkNote that the is a stochastic matrix. Each row is a probability distribution, so each row sums to 1.
RemarkIf the policy is stochastic, then the same matrix form still works. We only need to define: and
Now all notation is ready. Let’s unpack the matrix equation:
Recall the scalar Bellman expectation equation:
The matrix-vector multiplication is exactly the vector form of:
Suppose the diagram uses 3 states, i.e., :
Then:
where .
The transition matrix is a square matrix:
Here,
The symbol means we enumerates all possible next states. This is why the matrix-vector product uses a -vector. For the row corresponding to current state , we get
Therefore
Using Strang’s favorite “column picture”, we can have the following diagrams help us understand the matrix-vector product better.
So the matrix equation
is simply the vectorized form of
Now, how do we solve it?
Start from:
Rearrange to isolate :
Factor out ( is the identity matrix):
Finally, multiply both sides by the inverse:
In code, however, we usually should not compute the inverse directly. It is better to solve the linear system:
function policy_evaluation(𝒫::MDP, π)
𝒮, R, T, γ = 𝒫.𝒮, 𝒫.R, 𝒫.T, 𝒫.γ
R′ = [R(s, π(s)) for s in 𝒮]
T′ = [T(s, π(s), s′) for s in 𝒮, s′ in 𝒮]
return (I - γ*T′)\R′
endRemarkThe backslash operator
\solves the linear system directly.
Supplement Question
Now we know the “what”. But let’s ask “why”.
Why we can write:
At first, this may feel confusing because appears on both sides. But in the iterative code, we have:
function iterative_policy_evaluation(𝒫::MDP, π, k_max)
𝒮, T, R, γ = 𝒫.𝒮, 𝒫.T, 𝒫.R, 𝒫.γ
U = [0.0 for s in 𝒮]
for k in 1:k_max
U = [lookahead(𝒫, U, s, π(s)) for s in 𝒮]
end
return U
endSo it looks like there are 2 different ideas:
- iterative update: old goes in, new comes out
- matrix equation: the same appears on both sides
How should we understand this?🤔
The key idea is that the matrix equation describes the fixed point of the iterative process.
The iterative method is
Define the Bellman expectation operator as:
At convergence, the value no longer changes:
So the fixed-point equation becomes
which is exactly
Tipis a contraction mapping in the norm with Lipschitz constant .
More explicitly:
Why?
The inequality holds because each row of is a probability distribution.
RemarkBy the Banach fixed-point theorem, the fixed point exists, is unique, and the iterative methods converges to it.
Now let’s connect this with the matrix solution.
Because is stochastic matrix, its eigenvalues have magnitude at most 1. Since , the eigenvalues of have magnitude strictly less than 1.
Therefore, the Neumann series converges:
So:
This also explains the iterative method.
If we start with (), then:
After many iterations, we get:
That is the same infinite series as the matrix inverse solution!
So the 2 methods are not contradictory:
- iterative policy evaluation approaches the fixed point step by step
- the linear-system method solves the fixed point directly
To conclude,
- Iterative policy evaluation is fixed-point iteration for the Bellman expectation equation.
- Exact policy evaluation solves the same fixed point as a linear system.