I recently started experimenting with Gymnasium and I noticed that its API and workflow are very similar to a probabilistic graphical model. Is this just a coincidence? I aim to compare the difference here.
Remark
I later learned on the documentation page that
gymwas explicitly designed to formulate reinforcement learning. But it is still a fun brain exercise to map the concept myself!
Probabilistic Graphical Model
Let’s review what a graphical model looks like.
First, take a look at the node.
- denotes the state.
- denotes the observation.
- denotes the probability distribution.
- denotes the policy where the policy’s parameters.
Next, take a look at the edge:
- the edge between state to observation represents a stochastic process (a “sensor model”), which may or may not capture all the information from state
- the edge from observation to action represents the policy.
- the edge between state and future state a transitional probability, also known as “dynamics”.
The gymnasium Code Snippet
Let’s look at this basic example from gymnasium.
import gymnasium as gym
env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset()
episode_over = False
total_reward = 0
while not episode_over:
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
total_reward += reward
episode_over = terminated or truncated
env.close()
Even without an explanation, you can infer some reinforcement learning patterns just by looking at the variable names and control flows.
actionobservationrewardstep
Now, let’s combine this with the math and dive deeper.
To the next state
Let’s focus on this specific line of Python
observation, reward, terminated, truncated, info = env.step(action)💬Question We can ask a few key questions here:
- what represents the “sampling” () from the distribution ?
- what represents ?
🗣Answer
The env.step() method acts as the sampler. It decides the next state based on the environment’s transition probabilities.
Identifying is a bit tricky. In short, the observation returned by this function corresponds to . However, depending on the environment, this observation might effectively be the state:
Remark
We will discuss the difference between MDP and POMDP more in the section The “God View” vs. The “Pixel View”.
In an environment like CartPole, the system is fully observed. We can actually verify this by comparing the observation to the internal state:
observation, reward, terminated, truncated, info = env.step(action)
print(f"observation: {observation}")
print(f"state: {env.unwrapped.state}")
# The output shows they are identical:
# observation: [ 0.02138549 0.61995554 0.06039844 -0.7173455 ]
# state: [ 0.02138549 0.61995552 0.06039844 -0.71734546]The Markov Property in Code
In the graphical model, we know that the states satisfy an important property called Markov property. This simply states that “the future state depends only on the present state”:
The expression above is read as “the state and the state are [conditional independence|conditionally independent] given() the state ”. ^965f65
Since the Markov property holds, we can clearly see which function signature is correct:
- ✅
env.step(action) - ❌
env.step(action, previous_history_list)
The “God View” vs. The “Pixel View”
Let’s use Pong as an example to illustrate the difference between POMDP and MDP.

According to the documentation, the observation space of Pong is an image:
observation_space=Box(0, 255, (210, 160, 3), np.uint8)Therefore, the agent cannot “understand” the full world state like it does in CartPole. In math terms, this means (the pixels) is not (the state). For example, looking at a single static image of Pong, you cannot tell if the ball is moving left or right. This is why we need to use technique like “frame stacking” to reconstruct the velocity components of the hidden state from partial information.
We can compare them like this:
| CartPole | Pong | |
|---|---|---|
| illustration | ![]() | ![]() |
| observation space | Cart Position, Cart Velocity, Pole Angle, Pole Angular Velocity | RGB Image |
| type | fully observed (MDP) | partially observed (POMDP) |
The Policy
Finally, let’s look at the expression: This represents the policy, which takes an observation as input and outputs an action.
In code, we often see 2 patterns
argmax(logits)distribution.sample()The first one corresponds to “Evaluation Mode”, while the second corresponds to “Training Mode”. This reflects the explore-exploit tradeoff: during training, you want to sample to explore as many different strategies as possible, but during evaluation, you want to exploit the best known action.