Understanding Action Chunking with Flow Matching

As part of my Homework 1: Imitation Learning for CS 185/285: Deep Reinforcement Learning and my research in robot learning, I needed to quickly get up to speed on the basics of Flow Matching. In my previous blog post, A Beginner’s Guide to Flow Matching, I explained the core intuition behind the topic.

This post dives deep into applying flow matching to action chunking, specifically geared toward those wanting to understand the mechanics behind modern robot imitation learning. To respect academic integrity, I won’t release my source code. Instead, I’ll explain the architecture and the math from a big picture perspective.

What is Action Chunking?

Let’s first review the concept of action chunking, popularized by research like 📄Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT). In the standard imitation learning, the policy is defined as:

\mathbf{a}_t \sim \pi_{\theta}(\cdot | \mathbf{o}_t).

The policy $\pi_{\theta}$ represents a probability distribution over a single action $\mathbf{a}_t$ conditioned on the current observation $\mathbf{o}_t$ . Instead of predicting just one single action $\mathbf{a}_t$ , the action chunking proposes predicting a sequence, or chunk, of future actions all at once:

\mathbf{A}_t \sim \pi_{\theta}(\cdot|\mathbf{o}_t),

where $\mathbf{A}_t=\set{\mathbf{a}_t, \mathbf{a}_{t+1},\mathbf{a}_{t+2},\dots,\mathbf{a}_{t+K-1}}$ is a fixed $K$ -length sequence of actions.

Remark

The creates an open-loop execution phase: the environment will receive the action from agent $\mathbf{a}_t$ at time $t$ , then $\mathbf{a}_{t+1}$ at time $t+1$ , and so on until $\mathbf{a}_{t+K-1}$ . The agent will query observation again on time $t+K$ .

As a concrete example, consider the Push-T task from 📄Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, where a robot agent pushes a T-shaped block into a target region.

The Push-T environment.

In this task, the observation relies on low-dimensional state vectors rather than high-dimensional images (i.e., we use a state $\mathbf{s}_t$ instead of a visual observation $\mathbf{o}_t$ ). The state is a 5-dimensional vector containing:

$x$ -coordinate of the agent
$y$ -coordinate of the agent
$x$ -coordinate of the T-block
$y$ -coordinate of the T-block
$\alpha$ (rotation angle) of the T-block

The action is a 2-vector $\displaystyle \begin{bmatrix}x\\y\end{bmatrix}$ dictating the agent’s target position.

Remark

For the distinction between $\mathbf{s}_t$ (fully observed) and $\mathbf{o}_t$ (partially observed), you can refer to Why the Gymnasium API Looks the Way It Does?.

Why Flow Matching?

The simple answer: generative models like flow matching and diffusion models are more expressive than a mixture of Gaussians.

Remark

Chelsea Finn refers to applying generative models in imitation learning as an “advanced version of imitation learning,” while Sergey Levine states that using generative models in imitation learning solves the issues of naive behavioral cloning. For a more in-depth discussion, I’ve written a separate post: Why Naive Behavioral Cloning Doesn’t Work?.

As an example, in the Push-T environment, a policy powered by a generative model can learn to approach the T-block from the left or from the right.

©️Chi, et al.

With a mixture of Gaussians, “the action to the left” averages with “the action to the right” and produces “go straight” — which is not what we want.😅

The Dataset: From Episodes to Training Samples

Expert demonstrations are recorded as episodes of varying lengths, depending on how long it took the expert to solve the task. To avoid jagged arrays during training, all episodes are concatenated into flat arrays. An episode_ends array marks the boundary indices of each episode.

For example, consider two short dummy episodes:

\begin{align} E^0 &= (\mathbf{s}_0, \mathbf{a}_0, \dots, \mathbf{s}_4, \mathbf{a}_4) \\ E^1 &= (\mathbf{s}_0, \mathbf{a}_0, \dots, \mathbf{s}_6, \mathbf{a}_6) \\ \end{align}

We concatenate their actions and states into flat vectors:

\begin{align} S_{\text{states}} &= (\mathbf{s}_0, \mathbf{s}_1, \dots, \mathbf{s}_{10}, \mathbf{s}_{11}) \\ A_{\text{actions}} &= (\mathbf{a}_0, \mathbf{a}_1, \dots, \mathbf{a}_{10}, \mathbf{a}_{11}) \\ \end{align}

The episode_ends array is [5, 12], meaning Episode 0 spans indices 0-4, and Episode 1 spans indices 5-11. With a chunk_size of $K=3$ , a sliding window extracts (state, action_chunk) pairs. Each training sample pairs one state with the next $K$ consecutive actions.

💬Question: Given the episodes above and $K=3$ , what are the valid starting indices? Why are indices 3 and 4 excluded for Episode 0?

🗣Answer: The valid starting indices would be [0, 1, 2, 5, 6, 7, 8, 9]. Indices 3 and 4 are excluded because they don’t have $K=3$ actions left before the episode ends. The same logic applies to Episode 1, where index 10 is excluded.

Episode 0:
[a0] [a1] [a2] [a3] [a4]
 ✓    ✓    ✓    ✗    ✗
 0    1    2    3    4

💬Question: During training, the dataloader shuffles and mixes samples from different episodes within a batch. Why is this acceptable?

🗣Answer: Because the underlying task remains constant. Every (state, action_chunk) pair is a self-contained snapshot of an intermediate step toward the same objective. Drawing randomized, uncorrelated transitions across different demonstrations also ensures the batch has high entropy, which improves gradient quality.

Training: What the Network Sees

In Flow Matching, we define a continuous flow from a simple noise distribution to our complex data distribution. Let’s establish our notation. Let the expert action chunk be our target data $\mathbf{A}_1$ , and let $\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I})$ be a sample of pure noise.

We introduce a flow-time variable $\tau \in [0, 1)$ .

Remark

We use $\tau$ to represent the integration time of the flow, keeping it distinct from $t$ , which represents the environment timestep.

During a training step, the network $v_\theta$ receives the current state $\mathbf{s}_t$ , a specific flow-time $\tau$ , and an interpolated action chunk $\mathbf{A}_\tau$ . We construct this straight-line interpolation as:

$\mathbf{A}_\tau = \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0$

The network’s objective is to predict the velocity (the derivative with respect to $\tau$ ). For a straight-line optimal transport path, the exact target velocity is simply:

\begin{align} \mathbf{A}_\tau &= \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0 \\ \frac{d}{d\tau} \mathbf{A}_\tau &= \frac{d}{d\tau}((1-\tau)\mathbf{A}_0 + \tau \mathbf{A}_1)\\ \frac{d}{dt} \mathbf{A}_\tau & = \mathbf{A}_1 - \mathbf{A}_0 \end{align}

Remark

Because during training, the $\mathbf{A}_1$ (expert action chunk) and $\mathbf{A}_0$ (initial action chunk) are constant.

Tip

Crucially, the network never sees the raw expert chunk $\mathbf{A}_1$ as an input. It only sees the noisy intermediate state $\mathbf{A}_\tau$ and learns the direction to push it toward reality.
Therefore, the network is denoted as $v_{\theta}$ (velocity) rather than $\pi_{\theta}$ .

For example, if the chunk size $K=3$ , then the dimension of the input of the network would be

state(5) + interpolated_chunk(2 * 3) + tau(1) = 12

and the output would be a velocity vector of dimension

action_dim(2) * chunk_size(3) = 6

💬Question: The target velocity ( $\mathbf{A}_1 - \mathbf{A}_0$ ) does not depend on $\tau$ . Why does $\tau$ still matter as an input to the network? What would go wrong if we always set $\tau = 0$ during training?

🗣Answer:

The target velocity is constant, but the location $\mathbf{A}_\tau$ in the vector field is constantly changing. We are training the network to predict the correct velocity from any coordinate along the path.

If we only trained on $\tau = 0$ , the network would only learn how to predict velocities when looking at pure noise. During inference, after taking the first integration step, the data becomes partially denoised ( $\tau > 0$ ). The network would have no idea how to handle this new, structured input, and the integration would immediately collapse.

💬Question: At $\tau = 0.0$ , the network input is pure noise. At $\tau \approx 1.0$ , it is almost the exact expert data. At which extreme is the prediction task hardest?

🗣Answer: It is hardest at $\tau = 0.0$ . The network is looking at complete static and has to guess the exact trajectory toward a highly specific, structured action chunk.

Inference: Euler Integration

At inference time, we do not have access to expert data. We start by sampling pure noise $\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I})$ and use the learned velocity field $v_\theta$ to integrate forward using the Euler method.

\mathbf{A}_{\tau + \Delta\tau} = \mathbf{A}_\tau + \Delta\tau \cdot v_\theta(\mathbf{s}_t, \mathbf{A}_\tau, \tau)

If we choose num_steps = 4, our step size is 1 / 4 = 0.25, i.e., $\Delta\tau = 0.25$ . The integration process looks like this:

Step	Current $\tau$	Velocity Input $\mathbf{A}_\tau$	Next chunk $\mathbf{A}_{\tau + \Delta\tau}$
0	0.00	Pure Noise	Chunk at $\tau = 0.25$
1	0.25	Partially denoised	Chunk at $\tau = 0.50$
2	0.50	Mostly structured	Chunk at $\tau = 0.75$
3	0.75	Highly structured	Final Action Chunk ( $\tau = 1.00$ )

💬 Question: Why is the network never evaluated at $\tau = 1.0$ during inference?

🗣Answer: Because the Euler step taken at $\tau = 0.75$ pushes the chunk exactly to the boundary of $\tau = 1.00$ . Once we arrive at the destination, the flow is complete, and we extract the actions to execute.

Tip

This mirrors how we train the model: we sample $\tau$ from a uniform distribution $[0, 1)$ . We don’t train on $\tau = 1.0$ because the velocity field doesn’t need to push the data anywhere once it has arrived.

💬 Question: What happens if you set num_steps = 1? Under what condition would a single step produce a perfect sample?

🗣 Answer: Setting num_steps = 1 means taking one massive Euler step:

\mathbf{A}_{1.0} = \mathbf{A}_0 + 1.0 \cdot v_\theta(\mathbf{s}_t, \mathbf{A}_0, 0.0).

This would only produce a perfect sample if the network learned the exact true velocity ( $\mathbf{A}_1 - \mathbf{A}_0$ ) flawlessly. Because Flow Matching paths are straight lines by construction, a perfect velocity prediction means one step is theoretically sufficient. In practice, predictions at $\tau=0$ are the noisiest, so breaking the flow into smaller steps allows the network to correct its course as the chunk becomes more structured.

Reflections

Why operate on the full chunk?

Flow matching doesn’t predict a single action. It generates an entire chunk of future actions at once. The noise, interpolation, and velocity all operate on the full flattened chunk vector.

💬Question: What advantage does this give over predicting each action in the chunk independently?

🗣Answer: It allows the model to learn temporal correlations across time steps. If action $\mathbf{a}_t$ initiates a rightward push, action $\mathbf{a}_{t+1}$ must logically follow through. Operating on the full chunk allows the velocity field to enforce physical consistency and smooth trajectories. Predicting actions independently would destroy this temporal coherence, leading to jerky, contradictory movements.

Multimodality

💬 Question: Imagine two expert demonstrations show “push left” and “push right” for the exact same state. How does Flow Matching mechanically produce both strategies? Where does the “choice” come from?

🗣 Answer: The choice comes entirely from the initial noise $\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I})$ sampled at the start of inference.

If noise sample $A$ lands on one side of the latent noise space, the learned velocity field sweeps it toward the “push left” action chunk. If noise sample $B$ is drawn, it might land in a region that flows toward the “push right” chunk. The state $\mathbf{s}_t$ is identical in both cases; the random starting point dictates the final mode. This is why generative models easily handle multimodality, whereas a standard MSE policy would simply average the two demonstrations, resulting in a useless “push straight” command.

How does the model learn the full velocity field?

During training, each dataset sample is paired with one random $\tau$ and one random noise vector per step. The model never sees the same sample tracked across all $\tau$ values in a single pass.

💬 Question: How does the model eventually learn the velocity field across the entire $\tau \in [0, 1)$ range?

🗣 Answer: The coverage of the flow-time $\tau$ is achieved over the course of multiple epochs. The training loop behaves like this:

\begin{align*} &\text{For each } \text{epoch}: \\ &\quad \text{For each } (\mathbf{s}_t, \mathbf{A}_1) \text{ in shuffled dataset}: \\ &\quad\quad \tau \sim \text{Uniform}[0, 1) \\ &\quad\quad \mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I}) \\ &\quad\quad \mathbf{A}_\tau = \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0 \\ &\quad\quad v_{\text{target}} = \mathbf{A}_1 - \mathbf{A}_0 \\ &\quad\quad \hat{v} = v_\theta(\mathbf{s}_t, \mathbf{A}_\tau, \tau) \\ &\quad\quad \mathcal{L} = || \hat{v} - v_{\text{target}} ||^2 \\ &\quad\quad \text{Update } \theta \text{ using } \nabla_\theta \mathcal{L} \end{align*}

Because the dataset is iterated over hundreds of times, a specific state $\mathbf{s}_t$ will eventually be evaluated against many different values of $\tau$ and many different noise vectors $\mathbf{A}_0$ . Over time, the network pieces together the complete continuous vector field.

Cross-episode shuffling

💬Question: The dataloader shuffles the dataset every epoch. Does “shuffle” mean rearranging chunks within a single episode, or across all episodes?

🗣Answer: It shuffles across all episodes since each pair is self-contained. A single batch freely mixes samples from different demonstrations:

Batch: indices[7], indices[1], indices[5]
     =       8,          1,          6
     → (s8, [a8,a9,a10]),  (s1, [a1,a2,a3]),  (s6, [a6,a7,a8])
       ^^^ from ep1          ^^^ from ep0       ^^^ from ep1

Understanding Action Chunking with Flow Matching

What is Action Chunking?

Why Flow Matching?

The Dataset: From Episodes to Training Samples

Training: What the Network Sees

Inference: Euler Integration

Reflections

Why operate on the full chunk?

Multimodality

How does the model learn the full velocity field?

Cross-episode shuffling

See also...