This week marks a fresh start. I have paused my previous project on 3D tetromino 6D pose estimation and shifted to an industrial-funded project related to VLA models.
This is my first time I stepped into the field of robot learning. It’s a mix of excitement and intimidation. I am excited because I am now involved in one of the most competitive research fields today. However, I also feel a bit “frightened” seeing how many talented researchers are investing their time here.
Taxonomy of Robot Learning
One lesson I learned from the 3D Tetris project is that I made the mistake of not reading a survey paper like 📄Deep Learning-Based Object Pose Estimation: A Comprehensive Survey at the very beginning. Because of that, I was missing the complete picture.
I tried to eliminate this mistake for my new project. Since this field is changing so rapidly, there isn’t really a formal textbook that covers the big picture. Instead, I turned my focus to university courses. Among the courses titled “Robot Learning”, I found CMU 16-831: Introduction to Robot Learning to be the best one. The taxonomy presented in the first lecture has become my guidance. Questions and puzzles I had pondered for a while finally seem to have a clue.
Thanks to Prof. David Held. This table is gold 💰 and has saved me so much time.
| categories | ||
|---|---|---|
| action space | discrete | continuous |
| type of feedback | instructive / supervised | evaluative (reward) |
| type of interaction | one-shot | sequential |
| ^6ea5aa |
Bridging the Sim2Real Gap
This week, I quickly prototyped a simulation of Galaxea R1 Pro using ManiSkill. Below is a comparison between the simulation and reality.
| Reality | Simulation | |
|---|---|---|
| Head | ||
| Left Hand | ||
| Right Hand |
There are visible gaps between Sim and Real. The first is quite obvious: the wrist camera views in the simulation don’t quite align with reality. Although the URDF was copied directly from the NVIDIA Jetson AGX inside the Galaxea R1 Pro, I believe there are still some mismatches.
The second gap is the environment itself, factors like table texture, room lighting, and atmosphere, etc.
While these are observable visual gaps, the most worrying part is the dynamics. How can we ensure the physics in the simulation align with reality? I think a fun research direction is “Real2Sim2Real”. In a nutshell, we collect real-world data to improve a differentiable simulator, thereby closing the gap.
A few papers worth recommending:
Similarity between Gym and Probabilistic Graphical Model
During my study on reinforcement learning, I noticed a similarity between the API of Gymnasium and the probabilistic graphical model. I wrote down my reflection in Why the Gymnasium API Looks the Way It Does?.