Logo Xingxin on Bug

Why the Gymnasium API Looks the Way It Does?

February 2, 2026
5 min read

I recently started experimenting with Gymnasium and I noticed that its API and workflow are very similar to a probabilistic graphical model. Is this just a coincidence? I aim to compare the difference here.

Remark

I later learned on the documentation page that gym was explicitly designed to formulate reinforcement learning. But it is still a fun brain exercise to map the concept myself!

Probabilistic Graphical Model

Let’s review what a graphical model looks like.

probabilistic-graphical-model-2-states.svg

First, take a look at the node.

  • ss denotes the state.
  • oo denotes the observation.
  • pp denotes the probability distribution.
  • π\pi denotes the policy where θ\theta the policy’s parameters.

Next, take a look at the edge:

  • the edge between state s1s_1 to observation o1o_1 represents a stochastic process (a “sensor model”), which may or may not capture all the information from state s1s_1
  • the edge from observation o1o_1 to action a1a_1 represents the policy.
  • the edge between state s1s_1 and future state s2s_2 a transitional probability, also known as “dynamics”.

The gymnasium Code Snippet

Let’s look at this basic example from gymnasium.

import gymnasium as gym
env = gym.make("CartPole-v1", render_mode="human")
observation, info = env.reset()
episode_over = False
total_reward = 0
while not episode_over:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    episode_over = terminated or truncated
env.close()

Even without an explanation, you can infer some reinforcement learning patterns just by looking at the variable names and control flows.

  • action
  • observation
  • reward
  • step

Now, let’s combine this with the math and dive deeper.

To the next state

PGM-state-transition.svg

Let’s focus on this specific line of Python

observation, reward, terminated, truncated, info = env.step(action)

💬Question We can ask a few key questions here:

  • what represents the “sampling” (\sim) from the distribution pp?
  • what represents st+1s_{t+1}?

🗣Answer The env.step() method acts as the sampler. It decides the next state based on the environment’s transition probabilities.

Identifying st+1s_{t+1} is a bit tricky. In short, the observation returned by this function corresponds to ot+1o_{t+1}. However, depending on the environment, this observation might effectively be the state:

  • if st+1=ot+1s_{t+1} = o_{t+1}, it is a MDP (fully observed).
  • if st+1ot+1s_{t+1} \neq o_{t+1}, it is a POMDP (partially observed).

Remark

We will discuss the difference between MDP and POMDP more in the section The “God View” vs. The “Pixel View”.

In an environment like CartPole, the system is fully observed. We can actually verify this by comparing the observation to the internal state:

observation, reward, terminated, truncated, info = env.step(action)
print(f"observation: {observation}")
print(f"state: {env.unwrapped.state}")
 
# The output shows they are identical:
# observation: [ 0.02138549  0.61995554  0.06039844 -0.7173455 ]
# state: [ 0.02138549  0.61995552  0.06039844 -0.71734546]

The Markov Property in Code

In the graphical model, we know that the states satisfy an important property called Markov property. This simply states that “the future state depends only on the present state”: st+1st1st\mathbf{s}_{t+1} \perp \mathbf{s}_{t-1} | \mathbf{s}_t

The expression above is read as “the state st+1\mathbf{s}_{t+1} and the state st1\mathbf{s}_{t-1} are [conditional independence|conditionally independent] given(|) the state st\mathbf{s}_{t}”. ^965f65

Since the Markov property holds, we can clearly see which function signature is correct:

  • env.step(action)
  • env.step(action, previous_history_list)

The “God View” vs. The “Pixel View”

Let’s use Pong as an example to illustrate the difference between POMDP and MDP.

pong.gif

According to the documentation, the observation space of Pong is an image:

observation_space=Box(0, 255, (210, 160, 3), np.uint8)

Therefore, the agent cannot “understand” the full world state like it does in CartPole. In math terms, this means oto_t (the pixels) is not sts_t (the state). For example, looking at a single static image of Pong, you cannot tell if the ball is moving left or right. This is why we need to use technique like “frame stacking” to reconstruct the velocity components of the hidden state sts_t from partial information.

We can compare them like this:

CartPolePong
illustrationpong.gif
observation spaceCart Position, Cart Velocity, Pole Angle, Pole Angular VelocityRGB Image
typefully observed (MDP)partially observed (POMDP)

The Policy

Finally, let’s look at the expression: πθ(atot)\pi_\theta(a_t | o_t) This represents the policy, which takes an observation as input and outputs an action.

In code, we often see 2 patterns

  • argmax(logits)
  • distribution.sample() The first one corresponds to “Evaluation Mode”, while the second corresponds to “Training Mode”. This reflects the explore-exploit tradeoff: during training, you want to sample to explore as many different strategies as possible, but during evaluation, you want to exploit the best known action.