Logo Xingxin on Bug

How to Version Large Git Datasets with DVC and R2?

October 28, 2025
5 min read

In this article, I will guide you through the process of versioning a large dataset (e.g., 10GB+) in your Git repo without actually git add those large files. The tech stack we are using is Cloudflare R2 and DVC(Data Version Control). As of October 28th, 2025, the Cloudflare R2’s free tier is generous enough to store 10GB for free, making this a powerful alternative to Git LFS.

Remark

The Cloudflare R2 is an object storage service that is compatible with the Amazon S3 api. This is what allows DVC to work with it seamlessly.

Tip

This article is no-brainer, step-by-step tutorial.

TL;DR

The idea is to commit a lightweight pointer file(hash) to Git. Your actual large data files are pushed to an S3-compatible bucket(Cloudflare R2).


Step 1: Register a Cloudflare Account and R2 Bucket

  • Register: Navigate to this link to register a Cloudflare account if you haven’t.
  • Create Bucket: Go to the R2 page on your Cloudflare dashboard. Click “Create bucket,” give it a unique name, and choose a location.

Remark

A bucket is like a drive in your PC(e.g., D:). You can create folders inside it to organize your data.

Step 2: Create an R2 API Toekn

Why do we need an API token?

You need to give DVC permission to read and write to your bucket.

Follow the orders:

  • From the R2 Overview page, click “Manage R2 API Tokens” on the right sidebar. (there is a button like {} Manage.)
  • Click “Create API Token”.
  • Give the token a name (e.g., “dvc-worker”).
  • For permissions, select “Read & Write” so DVC can push and pull data.
  • You can apply this token to all buckets or a specific one. For this guide, selecting your new bucket is a good choice.
  • Click “Create API Token” and immediately copy the Access Key ID and Secret Access Key. You will not see the secret key again!

r2_create_user_token.webp

Step 3: Install DVC

Now, in your local project’s terminal, install Data Version Control. We include the [s3] extra, which gives DVC the libraries needed to talk to S3-compatible services.

# assume you have a Python environment with pip
pip install dvc[s3]

Step 4: Configure DVC in Your Git Repo

This is where we connect your local project to the R2 bucket.


📌Initialize DVC

In the root of your Git repository, run:

dvc init

This creates a .dvc/ directory, which you git add it.


📌Find Your R2 Endpoint

Go to the R2 bucket overview page and click the bucket you just created. r2_bucket.webp

Copy the “S3 API” endpoint URL. r2_s3_api.webp

It will look like this: https://<ACCOUNT_ID>.r2.cloudflarestorage.com


📌Add the DVC Remote

Run these commands in your terminal, replacing the placeholders with your own values. Let’s call our remote my-r2-remote.

# Set the bucket as the remote
dvc remote add -d my-r2-remote s3://<your_bucket_name>
 
# Set the R2 endpoint URL
dvc remote modify my-r2-remote endpointurl https://<ACCOUNT_ID>.r2.cloudflarestorage.com

📌Set Your Credentials Now, tell DVC the API keys you saved in Step 2. These are stored in your local DVC config, not checked into Git.

dvc remote modify --local my-r2-remote access_key_id 'YOUR_ACCESS_KEY_ID'
dvc remote modify --local my-r2-remote secret_access_key 'YOUR_SECRET_ACCESS_KEY'

Remark

If you enabled MFA, you might need to add session_token. See more at Amazon S3 | Data Version Control · DVC

Step 5: Track and Push Your Data

This is the main workflow.

📌Add Your Data to DVC Tell DVC to track your large data file or folder. Suppose I have a dataset folder and it is 5GB.

dvc add dataset/

DVC creates a small file named datasets.dvc and automatically adds your original folder to .gitignore.

Here’s a quick glance on the .dvc file:

outs:
- md5: ea9d95d6be012b1962ca54a6b2861270.dir
  size: 3487406232
  nfiles: 23
  hash: md5
  path: datasets

📌Commit the Pointer File to Git This is the key step. You are committing the small pointer file, not the large dataset.

git add dataset.dvc .gitignore
git commit -m "Add large dataset"

📌Push Your Data to R2 Now, push the actual data to your cloud storage.

dvc push

Step 6: Verify Your Push

Congrats! Your data is now versioned. To test that it works, let’s mimic a fresh clone.

📌Simulate a fresh state

rm -rf .dvc/cache
rm -rf dataset/

📌Pull the data from R2

dvc pull

There you go! DVC reads the pointer file and downloads the large dataset from your R2 bucket. Your repo is compact, your data is versioned, and it can be pulled by anyone with access.