A Beginner's Guide to Flow Matching

As part of my Homework 1: Imitation Learning for CS 185/285: Deep Reinforcement Learning and my research in robot learning, I need to quickly get up to speed on the basics of Flow Matching. This blog post documents my intuitive understanding of the topic.

Ordinary Differential Equations (ODEs)

📌What is an ODE?

To understand flow matching, we first need to review the ODE. In this context, an ODE defines a flow and looks like the following:

\frac{dX_t}{dt} = u_t (X_t)

where

$t$ denotes time
$X_t$ refers to the point $X$ of our data at time $t$
$u_t (X_t)$ is the vector field (or velocity field) that prescribes the instantaneous movement or velocity of $X_t$ at any given location and time.

📌The Euler Method

To compute where a starting point ends up over time, we can approximate the solution to the ODE using the Euler method. We step forward with a small time step $h$ and update our position based on the velocity at the current vector field. The position at the next time step $t+h$ is calculated as:

X_{t+h} = u_t (X_t)\cdot h + X_t.

Let’s look at an example. Suppose we have:

vector field: $u_t (x) = -x$
starting position: $X_0 = 4$
step size: $h=0.5$

We want to find our final position at $t = 1.0$ . Here is the state table for our simulation:

Step	Time $t$	Position $X_t$	Velocity $u_t (X_t)$	Next Position $X_{t+h}$
0	0.0	4.0	-4.0	2.0
1	0.5	2.0	-2.0	1.0

ODE with Neural Networks

Instead of explicitly hardcoding a simple vector field like $u_t (x) = -x$ , we want a neural network to learn the vector field for us. We denote this learnable vector field as:

u_t^{\theta}(X_t).

The goal of this network with parameter $\theta$ is to learn a vector field that can seamlessly transform initial random noise $X_0$ into a complex data destination $X_1$ (such as an image) at time $t=1$ .

Let the starting noise be $x_0$ and the target real data be $x_1$ . To keep things simple, let’s define a point-to-point straight-line path from $x_0$ to $x_1$ . The position along this path at any time $t$ will be:

X_t = (1-t)x_0 + t x_1.

Matching The Velocity

To train our network to recreate this flow, we need its output to match the velocity of this exact path. In calculus, the target velocity (the derivative $\frac{d}{dt}X_t$ ) for this specific straight-line path is:

\begin{align} X_t &= (1-t)x_0 + t x_1\\ \frac{d}{dt} X_t &= \frac{d}{dt}((1-t)x_0 + t x_1)\\ \frac{d}{dt} X_t & = x_1 - x_0 \end{align}

Remark

Just a quick math explanation: we treat the starting noise $x_0$ and the final real data $x_1$ as constants. The derivative of a constant is simply $0$ , i.e., $\displaystyle \frac{d}{dx}(c) = 0$ , leaving us with just $x_1 - x_0$ as the velocity.

The flow matching is simply the process of training a network $u_t^{\theta}(X_t)$ to output this exact velocity so it can push initial data $X_0$ toward the target $X_1$ . The movement of data $X_t$ feels like flow.

Euler method for Inference

Suppose we have successfully trained a “perfect” neural network that learned a vector field $u_t^{\theta}(X_t)=4$ . We can use the Euler method to step through time and update our position at frequency at periodic time $h$ .

For example, given our setup:

Starting position: 1
Step size $h$ : 0.5
Predicted velocity $u_t(X_t)$ : 4

What is our new position, $X_{0.5}$ , after taking this first step?

The answer is

\begin{align} X_{t+h} &= u_t (X_t)\cdot h + X_t \\ X_{0.5} &= 4\cdot 0.5 + X_0 \\ X_{0.5} &= 4\cdot 0.5 + 1\\ X_{0.5} &= 3.\\ \end{align}

A Dummy Python Code Example

A massive advantage of Flow Matching is that it is a simulation-free framework, meaning we do not actually need to solve ODEs during training. We just map the path and calculate the loss. Here is a dummy Python example of what a training step looks like:

def dummy_neural_net(x_position, time_t):
    # In reality, this would be a PyTorch model with weights.
    # For now, it just predicts a constant velocity of 2.0.
    return 2.0 
 
def flow_matching_train_step(x_0, x_1, t):
    """
    x_0: starting random noise (e.g., 1.0)
    x_1: target real data (e.g., 5.0)
    t: current time step (e.g., 0.5)
    """
    
    # 1. Calculate the current position along our straight-line path
    x_t = (1 - t) * x_0 + t * x_1
    
    # 2. Get the network's velocity prediction
    predicted_velocity = dummy_neural_net(x_t, t)
    
    # 3. Calculate the exact target velocity (the derivative)
    target_velocity = x_1 - x_0
    
    # 4. Compute the Mean Squared Error (MSE) Loss
    loss = (predicted_velocity - target_velocity) ** 2
    
    return loss
 
# Let's run it with our previous example numbers!
current_loss = flow_matching_train_step(x_0=1.0, x_1=5.0, t=0.5)
print(f"Calculated Loss: {current_loss}")

A few key takeaways from this code:

The target_velocity is the ground truth we want the neural network to learn, which is simply x_1 - x_0, i.e. $\displaystyle \frac{d}{dt} X_t = x_1 - x_0$ .
The loss is calculated using MSE against the target velocity.

Why do we need to learn the intermediate $X_t$ ?

Since we know the velocity $\displaystyle \frac{d}{dt} X_t$ is a constant in our straight-line example, what is the point of learning the “intermediate” position $X_t$ based on time $t$ ? Why not just learn a direct mapping from $X_0 \to X_1$ ?

Let’s use the mental model of “inverting” the problem from Jacobi. Imagine we ignore time $t$ and intermediate positions $X_t$ , and simply train a network to jump straight from a random noise input $X_0$ to a final output $X_1$ .

During training, the network memorizes:

Noise A $\rightarrow$ Crisp picture of a Cat
Noise B $\rightarrow$ Crisp picture of a Dog

Now, during inference, we sample a brand new piece of noise, Noise C. This new noise happens to sit exactly 50% Noise A + 50% Noise B in our starting distribution.

What would the output $X_1$ be?

The answer is that the network would output a blurry, average mix of a cat AND a dog! By introducing time $t$ and learning the vector field instead of a direct jump, the model learns a structured, continuous trajectory that smoothly flows toward a distinct, realistic target.

In short, that’s the core distinction between supervised learning and generative model! That’s the distinction between mixture of Gaussian and flow model mentioned! They are all discussed in CS 185/285: Deep Reinforcement Learning and I finally understand it!

A Beginner's Guide to Flow Matching

Ordinary Differential Equations (ODEs)

ODE with Neural Networks

Matching The Velocity

Euler method for Inference

A Dummy Python Code Example

Why do we need to learn the intermediate XtX_tXt​?

See also...

Why do we need to learn the intermediate $X_t$ ?