Logo Xingxin on Bug

A Visual Guide to Hand-Eye Camera Calibration in Robotics

May 16, 2026
9 min read

Recently, I have been wrestling with camera calibration. During the process, I found myself pondering a few questions.

  • what exactly is hand-eye calibration?
  • what specific transformation are we trying to uncover?
  • how do ArUco and ChArUco boards help in this process?
  • does the position of the ArUco and ChArUco board relative to the end effector matter?

In this post, I’ll answer all these questions intuitively.

Remark

Btw, I built a self-contained repo to make Franka + RealSense camera calibration painless. No Docker required. No system ROS2 install needed. Just one command to setup.🎉

See franka_moveit_camera_calibration.

Notation

Let’s establish our notation first.

Remark

This note follows the notation convention from the Introduction to Robotics: Mechanics and Control and Russ Tedrake.

The term

AXB{}^A X^B

means “the pose of frame {B}\Set{B} measured in frame {A}\Set{A}”.

Example

Suppose the position of {B}\Set{B} measured from {A}\Set{A} is 1,0.5,0.51, 0.5, 0.5, then AXB{}^A X^B can be written as AXB=[0.01.00.01.01.00.00.00.50.00.01.00.50.00.00.01.0]{}^A X^B =\begin{bmatrix}0.0 & -1.0 & 0.0 & 1.0 \\ 1.0 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 1.0 & 0.5 \\ 0.0 & 0.0 & 0.0 & 1.0 \end{bmatrix}

Composition follows the frame labels mathematically:

AXBBXC=AXC.{}^AX^B {}^BX^C ={}^AX^C.
Tip

Hint: we normally read the transformation from right to left.

  • CAC \to A
  • CB,BAC \to B, B\to A
Remark

You can think of XX as a homogeneous transformation matrix or a rigid motion.

Frames

For camera calibration using a robot, we define the following frames.

  • BB: Robot Base frame
  • EE: Robot Effector frame
  • CC: Camera optical frame
  • TT: ArUco/ChArUco target frame

camera-calibration-frames-in-robotics.webp

Remark

I’ll use eye-to-hand as the primary example, but the core math also works for eye-in-hand. I’ll leave applying this logic to eye-in-hand as an exercise for you. 😆

Known Transformation

During calibration, we record multiple samples. Let’s use ii to denote an arbitrary sample.


The transformation

BXEi^B X^{E_i}

is known. It represents the pose of end-effector measured from the robot’s base. We can easily obtain this via robot’s forward kinematics.


The transformation CXTi^C X^{T_i} is also known. It represents the ChArUco/ArUco target frame measured from the camera’s optical frame.

How does the CXTi^C X^{T_i} output the same unit (e.g., meter) as BXEi^B X^{E_i}?

Before calibrating, we have to fill out some configuration values. moveit-calibration-target-params.webp Notice the parameters “longest board size” and “measured marker size”. These refer to the physical dimensions you measure in reality. The camera detects the marker in pixel space, but using the camera’s intrinsics and these known physical dimensions, the algorithm sclaes the detection into physical units that match the robot’s coordinate system.

Does precision matter?

Absolutely! You can measure the board with a ruler. But for high-precision calibration, exact tolerances are crucial. This is why professional calibration boards (like this one) can sell up to $143!

Unknown Transformation

The transformation

BXC^B X^C

is unknown. It refers to the camera frame measured from the robot base frame.Finding this is the entire goal of eye-to-hand calibration.


The transformation

EXT^E X^T

is unknown and it refers to the ArUco/ChAruCo target frame measured from the end-effector frame.

Tip

The good news is this transformation is fixed as long as the gripper firmly grasps this board.

Example

Here’s an example how ChAruco can be detected and referred as target frame {T}\Set{T}. eye-to-hand-detected-charuco.webp

How Do We Solve it?

Thinking mathematically, we have 2 unknowns and 2 knowns. To solve this, our intuition tell us we need to construct a system of equations.

equivalent-transformations.webp

Looking at the physical setup, we define the target pose BXT{}^B X^T through two different paths:

  • Path 1 (Through the arm): BETB \to E \to T which is BXEEXT^B X^{E} \cdot {}^E X^T
  • Path 2 (Through the camera): BCTB \to C \to T which is BXCCXT^B X^C \cdot {}^C X^{T}

Recall the composition rule we mention above that two homogeneous transformation matrices can be written as

AXBBXC=AXC.{}^AX^B {}^BX^C ={}^AX^C.

Since both paths end at the exact same physical target in space, we can equate them:

BXEiEXT=BXCCXTi,^B X^{E_i} \cdot {}^E X^T = {}^B X^C \cdot {}^C X^{T_i},

where both sides are BXT^B X^T.


Look closely at this equivalent transformation:

BXEiEXT=BXCCXTi.^B X^{E_i} \cdot {}^E X^T = {}^B X^C \cdot {}^C X^{T_i}.

It contains 2 unknowns (EXT^E X^T and BXC^B X^C). However, BXC^B X^C (the camera pose) is the only one we actually care about. The ChArUco/ArUco board offset EXT^E X^T in the gripper is just a fixed “nuisance parameter” blocking our way.

Therefore, our mathematical strategy is to cancel out EXT^E X^T completely so we are only left with the camera pose.

To do this, we need a second equation. If we take two distinct samples, we can use one to isolate the nuisance parameter and substitute it into the other, effectively erasing EXT^E X^T from our math entirely.

Let’s see how this variable elimination works. Suppose we take two different poses, ii and jj, giving us two equations:

BXEiEXT=BXCCXTi,BXEjEXT=BXCCXTj.\begin{align} ^B X^{E_i} \cdot {}^E X^T &= {}^B X^C \cdot {}^C X^{T_i} \tag{sample 1},\\ ^B X^{E_j} \cdot {}^E X^T &= {}^B X^C \cdot {}^C X^{T_j} \tag{sample 2}. \end{align}

We can isolate EXT^E X^T in the first equation by multiplying both sides by the inverse of the end-effector pose:

EXT=(BXEi)1BXCCXTi.^E X^T = (^{B} X^{E_i})^{-1} \cdot {}^B X^C \cdot {}^C X^{T_i}.

Next, we substitute this isolated EXT^E X^T into the second equation for sample jj:

BXEj[(BXEi)1BXCCXTi]=BXCCXTj.^B X^{E_j} \cdot \left[ (^{B} X^{E_i})^{-1} \cdot {}^B X^C \cdot {}^C X^{T_i} \right] = {}^B X^C \cdot {}^C X^{T_j}.

Now, we rearrange the terms to isolate the relative motions. Multiply the right side by (CXTi)1(^C X^{T_i})^{-1}:

[BXEj(BXEi)1BXCCXTi](CXTi)1=[BXCCXTj](CXTi)1BXEj(BXEi)1BXCCXTi(CXTi)1=BXCCXTj(CXTi)1BXEj(BXEi)1BXC=BXCCXTj(CXTi)1\begin{align} [^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} \cdot {}^B X^C \cdot {}^C X^{T_i}] \cdot \mathbf{(^C X^{T_i})^{-1}} &= [^B X^C \cdot {}^C X^{T_j}] \cdot \mathbf{(^C X^{T_i})^{-1}} \\ ^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} \cdot {}^B X^C \cancel{{}^C X^{T_i} \cdot (^C X^{T_i})^{-1}} &= {}^B X^C \cdot {}^C X^{T_j} \cdot (^{C} X^{T_i})^{-1}\\ ^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} \cdot {}^B X^C &= {}^B X^C \cdot {}^C X^{T_j} \cdot (^{C} X^{T_i})^{-1} \end{align}

Let’s take a careful look on the last equation:

BXEj(BXEi)1BXC=BXCCXTj(CXTi)1.\underbrace {^B X^{E_j} \cdot (^{B} X^{E_i})^{-1}} \cdot {}^B X^C = {}^B X^C \cdot \underbrace{{}^C X^{T_j} \cdot (^{C} X^{T_i})^{-1}}.

Notice that both bracketed terms describe a relative motion between the 2 samples:

  • Let A=BXEj(BXEi)1A = {}^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} . This is the relative motion measured by the robot.
  • Let B=CXTj(CXTi)1B = {}^C X^{T_j} \cdot (^{C} X^{T_i})^{-1} . This is the relative motion measured by the camera.
  • Let X=BXCX = {}^B X^C . This is the static unknown camera pose we want to find.

We have successfully derived the most famous equation in hand-eye calibration:

AX=XB.A X = X B.


Remark

I’ll unveil the conceptual solution here, but to deeply understand this, I recommend two materials:

Let’s clarify the terminology:

  • A Sample (or Pose): This is a single, static snapshot. The robot stops moving, and the camera takes a single picture of the marker. Normally you would click the “Take Sample” button.
  • A Relative Motion: This is the physical movement between 2 samples. For example, the BXEj(BXEi)1{}^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} is a relative motion between 2 samples.
  • An Equation: In the AX=XBAX = XB format, one equation represents exactly one relative motion.

To solve XX, we at least need

  • 3 distinct samples
  • 2 relative motions
Remark

When taking these samples, the motions must occur around different physical axes. Curious why? The screw theory explains everything!

Now let’s use i,j,ki, j, k to denote 3 samples. We then have 2 relative motions:

A1X=XB1A2X=XB2\begin{align} A_1 X &= X B_1 \tag{motion 1} \\ A_2 X &= X B_2 \tag{motion 2} \end{align}

where the known relative motions are defined as:

  • Motion 1: A1=BXEj(BXEi)1A_1 = {}^B X^{E_j} \cdot (^{B} X^{E_i})^{-1} and B1=CXTj(CXTi)1B_1 = {}^C X^{T_j} \cdot (^{C} X^{T_i})^{-1}
  • Motion 2: A2=BXEk(BXEj)1A_2 = {}^B X^{E_k} \cdot (^{B} X^{E_j})^{-1} and B2=CXTk(CXTj)1B_2 = {}^C X^{T_k} \cdot (^{C} X^{T_j})^{-1}
If the math only strictly requires 3 samples, why do MoveIt and ROS tutorials usually ask you to collect 15 to 20?

In a mathematically perfect simulation, A1X=XB1A_1 X = X B_1 and A2X=XB2A_2 X = X B_2 are fully sufficient to lock down the 6 degrees of freedom. However, in physical reality, our knowns are flawed. The robot joints have tiny physical deflections (making AA noisy), and the camera has limited resolution and lighting artifacts (making BB noisy).

To combat this noise, we generalize this into an over-determined system of NN equations:

A1X=XB1A2X=XB2ANX=XBN\begin{align} A_1 X &= X B_1 \\ A_2 X &= X B_2 \\ &\vdots \\ A_N X &= X B_N \end{align}

By collecting many samples across diverse angles and rotations throughout the robot’s workspace, we feed all these AmX=XBmA_m X = X B_m equations into an optimization algorithm (such as the Tsai-Lenz method).

In short, instead of looking for a “perfect” XX, we minimize the error to pursue an optimal XX.

See also...