HKUST PhD Chronicle, Week 24, Into the Robot Learning Verse

This week marks a fresh start. I have paused my previous project on 3D tetromino 6D pose estimation and shifted to an industrial-funded project related to VLA models.

This is my first time I stepped into the field of robot learning. It’s a mix of excitement and intimidation. I am excited because I am now involved in one of the most competitive research fields today. However, I also feel a bit “frightened” seeing how many talented researchers are investing their time here.

Taxonomy of Robot Learning

One lesson I learned from the 3D Tetris project is that I made the mistake of not reading a survey paper like 📄Deep Learning-Based Object Pose Estimation: A Comprehensive Survey at the very beginning. Because of that, I was missing the complete picture.

I tried to eliminate this mistake for my new project. Since this field is changing so rapidly, there isn’t really a formal textbook that covers the big picture. Instead, I turned my focus to university courses. Among the courses titled “Robot Learning”, I found CMU 16-831: Introduction to Robot Learning to be the best one. The taxonomy presented in the first lecture has become my guidance. Questions and puzzles I had pondered for a while finally seem to have a clue.

Thanks to Prof. David Held. This table is gold 💰 and has saved me so much time.

categories
action space	discrete	continuous
type of feedback	instructive / supervised	evaluative (reward)
type of interaction	one-shot	sequential
^6ea5aa

Bridging the Sim2Real Gap

This week, I quickly prototyped a simulation of Galaxea R1 Pro using ManiSkill. Below is a comparison between the simulation and reality.

	Reality	Simulation
Head
Left Hand
Right Hand

There are visible gaps between Sim and Real. The first is quite obvious: the wrist camera views in the simulation don’t quite align with reality. Although the URDF was copied directly from the NVIDIA Jetson AGX inside the Galaxea R1 Pro, I believe there are still some mismatches.

The second gap is the environment itself, factors like table texture, room lighting, and atmosphere, etc.

While these are observable visual gaps, the most worrying part is the dynamics. How can we ensure the physics in the simulation align with reality? I think a fun research direction is “Real2Sim2Real”. In a nutshell, we collect real-world data to improve a differentiable simulator, thereby closing the gap.

A few papers worth recommending:

Similarity between Gym and Probabilistic Graphical Model

During my study on reinforcement learning, I noticed a similarity between the API of Gymnasium and the probabilistic graphical model. I wrote down my reflection in Why the Gymnasium API Looks the Way It Does?.