Logo Xingxin on Bug

A Beginner's Guide to Flow Matching

February 21, 2026
7 min read

As part of my Homework 1: Imitation Learning for CS 185/285: Deep Reinforcement Learning and my research in robot learning, I need to quickly get up to speed on the basics of Flow Matching. This blog post documents my intuitive understanding of the topic.

Ordinary Differential Equations (ODEs)

๐Ÿ“ŒWhat is an ODE?

To understand flow matching, we first need to review the ODE. In this context, an ODE defines a flow and looks like the following:

dXtdt=ut(Xt)\frac{dX_t}{dt} = u_t (X_t)

where

  • tt denotes time
  • XtX_t refers to the point XX of our data at time tt
  • ut(Xt)u_t (X_t) is the vector field (or velocity field) that prescribes the instantaneous movement or velocity of XtX_t at any given location and time.

๐Ÿ“ŒThe Euler Method

To compute where a starting point ends up over time, we can approximate the solution to the ODE using the Euler method. We step forward with a small time step hh and update our position based on the velocity at the current vector field. The position at the next time step t+ht+h is calculated as:

Xt+h=ut(Xt)โ‹…h+Xt.X_{t+h} = u_t (X_t)\cdot h + X_t.

Letโ€™s look at an example. Suppose we have:

  • vector field: ut(x)=โˆ’xu_t (x) = -x
  • starting position: X0=4X_0 = 4
  • step size: h=0.5h=0.5

We want to find our final position at t=1.0t = 1.0. Here is the state table for our simulation:

StepTime ttPosition XtX_tโ€‹Velocity ut(Xt)u_t (X_t)Next Position Xt+hX_{t+h}โ€‹
00.04.0-4.02.0
10.52.0-2.01.0

ODE with Neural Networks

Instead of explicitly hardcoding a simple vector field like ut(x)=โˆ’xu_t (x) = -x, we want a neural network to learn the vector field for us. We denote this learnable vector field as:

utฮธ(Xt).u_t^{\theta}(X_t).

The goal of this network with parameter ฮธ\theta is to learn a vector field that can seamlessly transform initial random noise X0X_0 into a complex data destination X1X_1 (such as an image) at time t=1t=1.

Let the starting noise be x0x_0 and the target real data be x1x_1. To keep things simple, letโ€™s define a point-to-point straight-line path from x0x_0 to x1x_1. The position along this path at any time tt will be:

Xt=(1โˆ’t)x0+tx1.X_t = (1-t)x_0 + t x_1.

Matching The Velocity

To train our network to recreate this flow, we need its output to match the velocity of this exact path. In calculus, the target velocity (the derivative ddtXt\frac{d}{dt}X_t) for this specific straight-line path is:

Xt=(1โˆ’t)x0+tx1ddtXt=ddt((1โˆ’t)x0+tx1)ddtXt=x1โˆ’x0\begin{align} X_t &= (1-t)x_0 + t x_1\\ \frac{d}{dt} X_t &= \frac{d}{dt}((1-t)x_0 + t x_1)\\ \frac{d}{dt} X_t & = x_1 - x_0 \end{align}

Remark

Just a quick math explanation: we treat the starting noise x0x_0 and the final real data x1x_1 as constants. The derivative of a constant is simply 00, i.e., ddx(c)=0\displaystyle \frac{d}{dx}(c) = 0, leaving us with just x1โˆ’x0x_1 - x_0 as the velocity.

The flow matching is simply the process of training a network utฮธ(Xt)u_t^{\theta}(X_t) to output this exact velocity so it can push initial data X0X_0 toward the target X1X_1. The movement of data XtX_t feels like flow.

Euler method for Inference

Suppose we have successfully trained a โ€œperfectโ€ neural network that learned a vector field utฮธ(Xt)=4u_t^{\theta}(X_t)=4. We can use the Euler method to step through time and update our position at frequency at periodic time hh.

For example, given our setup:

  • Starting position: 1
  • Step size hh: 0.5
  • Predicted velocity ut(Xt)u_t(X_t): 4

What is our new position, X0.5X_{0.5}, after taking this first step?

The answer is

Xt+h=ut(Xt)โ‹…h+XtX0.5=4โ‹…0.5+X0X0.5=4โ‹…0.5+1X0.5=3.\begin{align} X_{t+h} &= u_t (X_t)\cdot h + X_t \\ X_{0.5} &= 4\cdot 0.5 + X_0 \\ X_{0.5} &= 4\cdot 0.5 + 1\\ X_{0.5} &= 3.\\ \end{align}

A Dummy Python Code Example

A massive advantage of Flow Matching is that it is a simulation-free framework, meaning we do not actually need to solve ODEs during training. We just map the path and calculate the loss. Here is a dummy Python example of what a training step looks like:

def dummy_neural_net(x_position, time_t):
    # In reality, this would be a PyTorch model with weights.
    # For now, it just predicts a constant velocity of 2.0.
    return 2.0 
 
def flow_matching_train_step(x_0, x_1, t):
    """
    x_0: starting random noise (e.g., 1.0)
    x_1: target real data (e.g., 5.0)
    t: current time step (e.g., 0.5)
    """
    
    # 1. Calculate the current position along our straight-line path
    x_t = (1 - t) * x_0 + t * x_1
    
    # 2. Get the network's velocity prediction
    predicted_velocity = dummy_neural_net(x_t, t)
    
    # 3. Calculate the exact target velocity (the derivative)
    target_velocity = x_1 - x_0
    
    # 4. Compute the Mean Squared Error (MSE) Loss
    loss = (predicted_velocity - target_velocity) ** 2
    
    return loss
 
# Let's run it with our previous example numbers!
current_loss = flow_matching_train_step(x_0=1.0, x_1=5.0, t=0.5)
print(f"Calculated Loss: {current_loss}")

A few key takeaways from this code:

  1. The target_velocity is the ground truth we want the neural network to learn, which is simply x_1 - x_0, i.e. ddtXt=x1โˆ’x0\displaystyle \frac{d}{dt} X_t = x_1 - x_0.
  2. The loss is calculated using MSE against the target velocity.

Why do we need to learn the intermediate XtX_t?

Since we know the velocity ddtXt\displaystyle \frac{d}{dt} X_t is a constant in our straight-line example, what is the point of learning the โ€œintermediateโ€ position XtX_t based on time tt? Why not just learn a direct mapping from X0โ†’X1X_0 \to X_1?

Letโ€™s use the mental model of โ€œinvertingโ€ the problem from Jacobi. Imagine we ignore time tt and intermediate positions XtX_t, and simply train a network to jump straight from a random noise input X0X_0 to a final output X1X_1.

During training, the network memorizes:

  • Noise A โ†’\rightarrow Crisp picture of a Cat
  • Noise B โ†’\rightarrow Crisp picture of a Dog

Now, during inference, we sample a brand new piece of noise, Noise C. This new noise happens to sit exactly 50% Noise A + 50% Noise B in our starting distribution.

What would the output X1X_1 be?

The answer is that the network would output a blurry, average mix of a cat AND a dog! By introducing time tt and learning the vector field instead of a direct jump, the model learns a structured, continuous trajectory that smoothly flows toward a distinct, realistic target.

In short, thatโ€™s the core distinction between supervised learning and generative model! Thatโ€™s the distinction between mixture of Gaussian and flow model mentioned! They are all discussed in CS 185/285: Deep Reinforcement Learning and I finally understand it!

See also...