As part of my Homework 1: Imitation Learning for CS 185/285: Deep Reinforcement Learning and my research in robot learning, I need to quickly get up to speed on the basics of Flow Matching. This blog post documents my intuitive understanding of the topic.
Ordinary Differential Equations (ODEs)
๐What is an ODE?
To understand flow matching, we first need to review the ODE. In this context, an ODE defines a flow and looks like the following:
where
- denotes time
- refers to the point of our data at time
- is the vector field (or velocity field) that prescribes the instantaneous movement or velocity of at any given location and time.
๐The Euler Method
To compute where a starting point ends up over time, we can approximate the solution to the ODE using the Euler method. We step forward with a small time step and update our position based on the velocity at the current vector field. The position at the next time step is calculated as:
Letโs look at an example. Suppose we have:
- vector field:
- starting position:
- step size:
We want to find our final position at . Here is the state table for our simulation:
| Step | Time | Position โ | Velocity | Next Position โ |
|---|---|---|---|---|
| 0 | 0.0 | 4.0 | -4.0 | 2.0 |
| 1 | 0.5 | 2.0 | -2.0 | 1.0 |
ODE with Neural Networks
Instead of explicitly hardcoding a simple vector field like , we want a neural network to learn the vector field for us. We denote this learnable vector field as:
The goal of this network with parameter is to learn a vector field that can seamlessly transform initial random noise into a complex data destination (such as an image) at time .
Let the starting noise be and the target real data be . To keep things simple, letโs define a point-to-point straight-line path from to . The position along this path at any time will be:
Matching The Velocity
To train our network to recreate this flow, we need its output to match the velocity of this exact path. In calculus, the target velocity (the derivative ) for this specific straight-line path is:
Remark
Just a quick math explanation: we treat the starting noise and the final real data as constants. The derivative of a constant is simply , i.e., , leaving us with just as the velocity.
The flow matching is simply the process of training a network to output this exact velocity so it can push initial data toward the target . The movement of data feels like flow.
Euler method for Inference
Suppose we have successfully trained a โperfectโ neural network that learned a vector field . We can use the Euler method to step through time and update our position at frequency at periodic time .
For example, given our setup:
- Starting position: 1
- Step size : 0.5
- Predicted velocity : 4
What is our new position, , after taking this first step?
The answer is
A Dummy Python Code Example
A massive advantage of Flow Matching is that it is a simulation-free framework, meaning we do not actually need to solve ODEs during training. We just map the path and calculate the loss. Here is a dummy Python example of what a training step looks like:
def dummy_neural_net(x_position, time_t):
# In reality, this would be a PyTorch model with weights.
# For now, it just predicts a constant velocity of 2.0.
return 2.0
def flow_matching_train_step(x_0, x_1, t):
"""
x_0: starting random noise (e.g., 1.0)
x_1: target real data (e.g., 5.0)
t: current time step (e.g., 0.5)
"""
# 1. Calculate the current position along our straight-line path
x_t = (1 - t) * x_0 + t * x_1
# 2. Get the network's velocity prediction
predicted_velocity = dummy_neural_net(x_t, t)
# 3. Calculate the exact target velocity (the derivative)
target_velocity = x_1 - x_0
# 4. Compute the Mean Squared Error (MSE) Loss
loss = (predicted_velocity - target_velocity) ** 2
return loss
# Let's run it with our previous example numbers!
current_loss = flow_matching_train_step(x_0=1.0, x_1=5.0, t=0.5)
print(f"Calculated Loss: {current_loss}")A few key takeaways from this code:
- The
target_velocityis the ground truth we want the neural network to learn, which is simplyx_1 - x_0, i.e. . - The loss is calculated using MSE against the target velocity.
Why do we need to learn the intermediate ?
Since we know the velocity is a constant in our straight-line example, what is the point of learning the โintermediateโ position based on time ? Why not just learn a direct mapping from ?
Letโs use the mental model of โinvertingโ the problem from Jacobi. Imagine we ignore time and intermediate positions , and simply train a network to jump straight from a random noise input to a final output .
During training, the network memorizes:
- Noise A Crisp picture of a Cat
- Noise B Crisp picture of a Dog
Now, during inference, we sample a brand new piece of noise, Noise C. This new noise happens to sit exactly 50% Noise A + 50% Noise B in our starting distribution.
What would the output be?
The answer is that the network would output a blurry, average mix of a cat AND a dog! By introducing time and learning the vector field instead of a direct jump, the model learns a structured, continuous trajectory that smoothly flows toward a distinct, realistic target.
In short, thatโs the core distinction between supervised learning and generative model! Thatโs the distinction between mixture of Gaussian and flow model mentioned! They are all discussed in CS 185/285: Deep Reinforcement Learning and I finally understand it!