Logo Xingxin on Bug

Understanding Action Chunking with Flow Matching

February 22, 2026
12 min read

As part of my Homework 1: Imitation Learning for CS 185/285: Deep Reinforcement Learning and my research in robot learning, I needed to quickly get up to speed on the basics of Flow Matching. In my previous blog post, A Beginner’s Guide to Flow Matching, I explained the core intuition behind the topic.

This post dives deep into applying flow matching to action chunking, specifically geared toward those wanting to understand the mechanics behind modern robot imitation learning. To respect academic integrity, I won’t release my source code. Instead, I’ll explain the architecture and the math from a big picture perspective.

What is Action Chunking?

Let’s first review the concept of action chunking, popularized by research like 📄Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT). In the standard imitation learning, the policy is defined as:

atπθ(ot).\mathbf{a}_t \sim \pi_{\theta}(\cdot | \mathbf{o}_t).

The policy πθ\pi_{\theta} represents a probability distribution over a single action at\mathbf{a}_t conditioned on the current observation ot\mathbf{o}_t. Instead of predicting just one single action at\mathbf{a}_t, the action chunking proposes predicting a sequence, or chunk, of future actions all at once:

Atπθ(ot),\mathbf{A}_t \sim \pi_{\theta}(\cdot|\mathbf{o}_t),

where At={at,at+1,at+2,,at+K1}\mathbf{A}_t=\set{\mathbf{a}_t, \mathbf{a}_{t+1},\mathbf{a}_{t+2},\dots,\mathbf{a}_{t+K-1}} is a fixed KK-length sequence of actions.

Remark

The creates an open-loop execution phase: the environment will receive the action from agent at\mathbf{a}_t at time tt, then at+1\mathbf{a}_{t+1} at time t+1t+1, and so on until at+K1\mathbf{a}_{t+K-1}. The agent will query observation again on time t+Kt+K.

As a concrete example, consider the Push-T task from 📄Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, where a robot agent pushes a T-shaped block into a target region.

push-t-environment-0-to-1.webp

The Push-T environment.

In this task, the observation relies on low-dimensional state vectors rather than high-dimensional images (i.e., we use a state st\mathbf{s}_t instead of a visual observation ot\mathbf{o}_t). The state is a 5-dimensional vector containing:

  • xx-coordinate of the agent
  • yy-coordinate of the agent
  • xx-coordinate of the T-block
  • yy-coordinate of the T-block
  • α\alpha (rotation angle) of the T-block

The action is a 2-vector [xy]\displaystyle \begin{bmatrix}x\\y\end{bmatrix} dictating the agent’s target position.

Remark

For the distinction between st\mathbf{s}_t(fully observed) and ot\mathbf{o}_t(partially observed), you can refer to Why the Gymnasium API Looks the Way It Does?.

Why Flow Matching?

The simple answer: generative models like flow matching and diffusion models are more expressive than a mixture of Gaussians.

Remark

Chelsea Finn refers to applying generative models in imitation learning as an “advanced version of imitation learning,” while Sergey Levine states that using generative models in imitation learning solves the issues of naive behavioral cloning. For a more in-depth discussion, I’ve written a separate post: Why Naive Behavioral Cloning Doesn’t Work?.

As an example, in the Push-T environment, a policy powered by a generative model can learn to approach the T-block from the left or from the right.

multimodal_sim.svg

©️Chi, et al.

With a mixture of Gaussians, “the action to the left” averages with “the action to the right” and produces “go straight” — which is not what we want.😅

The Dataset: From Episodes to Training Samples

Expert demonstrations are recorded as episodes of varying lengths, depending on how long it took the expert to solve the task. To avoid jagged arrays during training, all episodes are concatenated into flat arrays. An episode_ends array marks the boundary indices of each episode.

For example, consider two short dummy episodes:

E0=(s0,a0,,s4,a4)E1=(s0,a0,,s6,a6)\begin{align} E^0 &= (\mathbf{s}_0, \mathbf{a}_0, \dots, \mathbf{s}_4, \mathbf{a}_4) \\ E^1 &= (\mathbf{s}_0, \mathbf{a}_0, \dots, \mathbf{s}_6, \mathbf{a}_6) \\ \end{align}

We concatenate their actions and states into flat vectors:

Sstates=(s0,s1,,s10,s11)Aactions=(a0,a1,,a10,a11)\begin{align} S_{\text{states}} &= (\mathbf{s}_0, \mathbf{s}_1, \dots, \mathbf{s}_{10}, \mathbf{s}_{11}) \\ A_{\text{actions}} &= (\mathbf{a}_0, \mathbf{a}_1, \dots, \mathbf{a}_{10}, \mathbf{a}_{11}) \\ \end{align}

The episode_ends array is [5, 12], meaning Episode 0 spans indices 0-4, and Episode 1 spans indices 5-11. With a chunk_size of K=3K=3, a sliding window extracts (state, action_chunk) pairs. Each training sample pairs one state with the next KK consecutive actions.


💬Question: Given the episodes above and K=3K=3, what are the valid starting indices? Why are indices 3 and 4 excluded for Episode 0?

🗣Answer: The valid starting indices would be [0, 1, 2, 5, 6, 7, 8, 9]. Indices 3 and 4 are excluded because they don’t have K=3K=3 actions left before the episode ends. The same logic applies to Episode 1, where index 10 is excluded.

Episode 0:
[a0] [a1] [a2] [a3] [a4]
 ✓    ✓    ✓    ✗    ✗
 0    1    2    3    4

💬Question: During training, the dataloader shuffles and mixes samples from different episodes within a batch. Why is this acceptable?

🗣Answer: Because the underlying task remains constant. Every (state, action_chunk) pair is a self-contained snapshot of an intermediate step toward the same objective. Drawing randomized, uncorrelated transitions across different demonstrations also ensures the batch has high entropy, which improves gradient quality.

Training: What the Network Sees

In Flow Matching, we define a continuous flow from a simple noise distribution to our complex data distribution. Let’s establish our notation. Let the expert action chunk be our target data A1\mathbf{A}_1, and let A0N(0,I)\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I}) be a sample of pure noise.

We introduce a flow-time variable τ[0,1)\tau \in [0, 1).

Remark

We use τ\tau to represent the integration time of the flow, keeping it distinct from tt, which represents the environment timestep.

During a training step, the network vθv_\theta receives the current state st\mathbf{s}_t, a specific flow-time τ\tau, and an interpolated action chunk Aτ\mathbf{A}_\tau. We construct this straight-line interpolation as:

Aτ=τA1+(1τ)A0\mathbf{A}_\tau = \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0

The network’s objective is to predict the velocity (the derivative with respect to τ\tau). For a straight-line optimal transport path, the exact target velocity is simply:

Aτ=τA1+(1τ)A0ddτAτ=ddτ((1τ)A0+τA1)ddtAτ=A1A0\begin{align} \mathbf{A}_\tau &= \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0 \\ \frac{d}{d\tau} \mathbf{A}_\tau &= \frac{d}{d\tau}((1-\tau)\mathbf{A}_0 + \tau \mathbf{A}_1)\\ \frac{d}{dt} \mathbf{A}_\tau & = \mathbf{A}_1 - \mathbf{A}_0 \end{align}

Remark

Because during training, the A1\mathbf{A}_1(expert action chunk) and A0\mathbf{A}_0(initial action chunk) are constant.

Tip

Crucially, the network never sees the raw expert chunk A1\mathbf{A}_1 as an input. It only sees the noisy intermediate state Aτ\mathbf{A}_\tau and learns the direction to push it toward reality.
Therefore, the network is denoted as vθv_{\theta}(velocity) rather than πθ\pi_{\theta}.


For example, if the chunk size K=3K=3, then the dimension of the input of the network would be

state(5) + interpolated_chunk(2 * 3) + tau(1) = 12

and the output would be a velocity vector of dimension

action_dim(2) * chunk_size(3) = 6

💬Question: The target velocity (A1A0\mathbf{A}_1 - \mathbf{A}_0) does not depend on τ\tau. Why does τ\tau still matter as an input to the network? What would go wrong if we always set τ=0\tau = 0 during training?

🗣Answer:

The target velocity is constant, but the location Aτ\mathbf{A}_\tau in the vector field is constantly changing. We are training the network to predict the correct velocity from any coordinate along the path.

If we only trained on τ=0\tau = 0, the network would only learn how to predict velocities when looking at pure noise. During inference, after taking the first integration step, the data becomes partially denoised (τ>0\tau > 0). The network would have no idea how to handle this new, structured input, and the integration would immediately collapse.


💬Question: At τ=0.0\tau = 0.0, the network input is pure noise. At τ1.0\tau \approx 1.0, it is almost the exact expert data. At which extreme is the prediction task hardest?

🗣Answer: It is hardest at τ=0.0\tau = 0.0. The network is looking at complete static and has to guess the exact trajectory toward a highly specific, structured action chunk.

Inference: Euler Integration

At inference time, we do not have access to expert data. We start by sampling pure noise A0N(0,I)\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I}) and use the learned velocity field vθv_\theta to integrate forward using the Euler method.

Aτ+Δτ=Aτ+Δτvθ(st,Aτ,τ)\mathbf{A}_{\tau + \Delta\tau} = \mathbf{A}_\tau + \Delta\tau \cdot v_\theta(\mathbf{s}_t, \mathbf{A}_\tau, \tau)

If we choose num_steps = 4, our step size is 1 / 4 = 0.25, i.e., Δτ=0.25\Delta\tau = 0.25. The integration process looks like this:

StepCurrent τ\tauVelocity Input Aτ\mathbf{A}_\tauNext chunk Aτ+Δτ\mathbf{A}_{\tau + \Delta\tau}
00.00Pure NoiseChunk at τ=0.25\tau = 0.25
10.25Partially denoisedChunk at τ=0.50\tau = 0.50
20.50Mostly structuredChunk at τ=0.75\tau = 0.75
30.75Highly structuredFinal Action Chunk (τ=1.00\tau = 1.00)

💬 Question: Why is the network never evaluated at τ=1.0\tau = 1.0 during inference?

🗣Answer: Because the Euler step taken at τ=0.75\tau = 0.75 pushes the chunk exactly to the boundary of τ=1.00\tau = 1.00. Once we arrive at the destination, the flow is complete, and we extract the actions to execute.

Tip

This mirrors how we train the model: we sample τ\tau from a uniform distribution [0,1)[0, 1). We don’t train on τ=1.0\tau = 1.0 because the velocity field doesn’t need to push the data anywhere once it has arrived.


💬 Question: What happens if you set num_steps = 1? Under what condition would a single step produce a perfect sample?

🗣 Answer: Setting num_steps = 1 means taking one massive Euler step:

A1.0=A0+1.0vθ(st,A0,0.0).\mathbf{A}_{1.0} = \mathbf{A}_0 + 1.0 \cdot v_\theta(\mathbf{s}_t, \mathbf{A}_0, 0.0).

This would only produce a perfect sample if the network learned the exact true velocity (A1A0\mathbf{A}_1 - \mathbf{A}_0) flawlessly. Because Flow Matching paths are straight lines by construction, a perfect velocity prediction means one step is theoretically sufficient. In practice, predictions at τ=0\tau=0 are the noisiest, so breaking the flow into smaller steps allows the network to correct its course as the chunk becomes more structured.

Reflections

Why operate on the full chunk?

Flow matching doesn’t predict a single action. It generates an entire chunk of future actions at once. The noise, interpolation, and velocity all operate on the full flattened chunk vector.


💬Question: What advantage does this give over predicting each action in the chunk independently?

🗣Answer: It allows the model to learn temporal correlations across time steps. If action at\mathbf{a}_t initiates a rightward push, action at+1\mathbf{a}_{t+1} must logically follow through. Operating on the full chunk allows the velocity field to enforce physical consistency and smooth trajectories. Predicting actions independently would destroy this temporal coherence, leading to jerky, contradictory movements.

Multimodality

💬 Question: Imagine two expert demonstrations show “push left” and “push right” for the exact same state. How does Flow Matching mechanically produce both strategies? Where does the “choice” come from?

🗣 Answer: The choice comes entirely from the initial noise A0N(0,I)\mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I}) sampled at the start of inference.

If noise sample AA lands on one side of the latent noise space, the learned velocity field sweeps it toward the “push left” action chunk. If noise sample BB is drawn, it might land in a region that flows toward the “push right” chunk. The state st\mathbf{s}_t is identical in both cases; the random starting point dictates the final mode. This is why generative models easily handle multimodality, whereas a standard MSE policy would simply average the two demonstrations, resulting in a useless “push straight” command.

How does the model learn the full velocity field?

During training, each dataset sample is paired with one random τ\tau and one random noise vector per step. The model never sees the same sample tracked across all τ\tau values in a single pass.


💬 Question: How does the model eventually learn the velocity field across the entire τ[0,1)\tau \in [0, 1) range?

🗣 Answer: The coverage of the flow-time τ\tau is achieved over the course of multiple epochs. The training loop behaves like this:

For each epoch:For each (st,A1) in shuffled dataset:τUniform[0,1)A0N(0,I)Aτ=τA1+(1τ)A0vtarget=A1A0v^=vθ(st,Aτ,τ)L=v^vtarget2Update θ using θL\begin{align*} &\text{For each } \text{epoch}: \\ &\quad \text{For each } (\mathbf{s}_t, \mathbf{A}_1) \text{ in shuffled dataset}: \\ &\quad\quad \tau \sim \text{Uniform}[0, 1) \\ &\quad\quad \mathbf{A}_0 \sim \mathcal{N}(0, \mathbf{I}) \\ &\quad\quad \mathbf{A}_\tau = \tau \mathbf{A}_1 + (1 - \tau) \mathbf{A}_0 \\ &\quad\quad v_{\text{target}} = \mathbf{A}_1 - \mathbf{A}_0 \\ &\quad\quad \hat{v} = v_\theta(\mathbf{s}_t, \mathbf{A}_\tau, \tau) \\ &\quad\quad \mathcal{L} = || \hat{v} - v_{\text{target}} ||^2 \\ &\quad\quad \text{Update } \theta \text{ using } \nabla_\theta \mathcal{L} \end{align*}

Because the dataset is iterated over hundreds of times, a specific state st\mathbf{s}_t will eventually be evaluated against many different values of τ\tau and many different noise vectors A0\mathbf{A}_0. Over time, the network pieces together the complete continuous vector field.

Cross-episode shuffling

💬Question: The dataloader shuffles the dataset every epoch. Does “shuffle” mean rearranging chunks within a single episode, or across all episodes?

🗣Answer: It shuffles across all episodes since each pair is self-contained. A single batch freely mixes samples from different demonstrations:

Batch: indices[7], indices[1], indices[5]
     =       8,          1,          6
     → (s8, [a8,a9,a10]),  (s1, [a1,a2,a3]),  (s6, [a6,a7,a8])
       ^^^ from ep1          ^^^ from ep0       ^^^ from ep1

See also...