The art of robot reinforcement learning
Lessons learned by my third year of research
Hello world, I’m Elle. I am a third year PhD student focused on contact-rich manipulation via reinforcement learning (RL). I’ve essentially spent the last few years trying to persuade dexterous hands, robotic arms, and humanoids to do what I want.
Recently, I was invited to give a talk at the BeNeRL seminar series. They specifically asked for a “practical advice” section, and while preparing it, I realised that the reality of RL for robotics is far from the elegant math you find in textbooks. It’s a constant battle against strange simulator physics, grumpy MDPs, and brittle hyperparameters.
I decided to expand that talk into this post. These are my most important lessons - the things I wish I’d known before I started. If you have thoughts on any of these points, or your own stories, I would love to hear them in the comments :)
Your agent does not live on Earth
You wouldn’t fly by thermometer: Master the MDP
RL: Don’t start from scratch, but make it yours
Hyperparameters are your friend
WandB: Mission Control
Play the long game
1. Your agent does not live on Earth
After weeks of effort, you finally have a beautiful looking scene in Isaac Lab. Excited, you hit train.py right away and begin to work on your exciting new methodology. But you’ve actually forgotten something more important: the physics of the scene. Do you know:
What is the restitution of your object? Is it like a brick or a bouncy ball?
What is the dynamic friction of your robot’s left finger? Is it like ice or rubber?
What solver you are using, and how does it work? e.g., PGS vs. TGS in PhysX
How many position and velocity iterations do you allow? Too few and objects will go through each other, like the ghosts in Harry Potter.
What will happen if two bodies collide really fast?
Your simulation is a mathematical approximation of reality. If you don’t understand that approximation, you don’t understand the environment your agent is living in.
You cannot help an agent navigate a world whose fundamental laws are a mystery to you.
Tip: Spending one day reading the documentation for your physics backend (like PhysX or MuJoCo) can save you months debugging an agent that “should” work but doesn’t.
2. You wouldn’t fly by thermometer: Master the MDP
If the simulator is the world, the Markov Decision Process (MDP) is how your agent experiences and interacts with it. A poorly designed MDP is like asking a pilot to fly a plane using only a thermometer and a playstation controller.
Observations
Start with the kitchen sink, then cull. Include every piece of relevant info you can think of (in the proper format), get a policy working, and then trim.
Normalise everything. Ensure joint positions and velocities are normalised between [-1,1]. If you have continuous forces, pick a maximum and normalise between [0,1]. If you are using RBG pixels, convert to float, find the channel mean and normalise, then convert back into uint8 for storage. If you are using depth, choose a maximum depth and normalise between [0,1].



Normalised RGB (left, middle) and depth (right) observations. Careful with pixels. If you train with pixels, have a way to plot the observations from ALL environments simultaneously. I was having trouble training an agent to pick up an agent via RGB observations. I decided to plot all 512 envs, and found that a subset of my envs looked drastically different (the fix was simple - extend the ground plane).


Make sure your pixel observations are consistent. 512 envs with default (L) and extended (R) ground plane. Always be relative. Your agent doesn’t need to know the block is at (3.4, -2.4, 1.6) meters from the world frame. It needs to know it is (0.05, 0, 0.01) meters to the left of its gripper. Providing relative positions (both as a vector and euclidean distance) drastically expedites learning.
Rotation representations. Recent research suggests that neural networks can deal with rotation matrices better than quaternions [1]. Worth testing out yourself. And again, provide rotational information as relative to the robot or some goal.
Running scalers. It might feel weird to have your observations shifting, but many RL libraries provide “Running Standard Scalers” to normalise each element in the observation to have zero mean and unit variance. Its then “trained” once per rollout. Definitely worth trying to see if it improves.
Observation history. The majority of robotics tasks are partially observable. If your observation isn’t Markovian and you aren’t giving your agent an observation history (flattened into a 1D vector or handled via a wrapper like
LazyFrames), you are making life very hard for your agent.
Actions
Policy activation. For position-controlled robots we typically want the final policy activation to be
tanh(), which we can then scale up to[min_joint_limit, max_joint_limit].Only actuate what you need. If you’re training a humanoid to pick up a box, you likely do not need the head joints. You might think it’s harmless to include, like an extra observation, but actions are different. My personal experience and recent work [2] has shown that including extra DoFs in the actions can greatly degrade performance, making the problem too complex. Finally, only include the actuated joints in the observation.
Smoothing. It is harder than would seem to get your robot to move nice and smoothly. This is because dense rewards typically incentivise the agent to get somewhere as quickly as possible. From my understanding, there are three methods for enforcing smooth behaviour: limiting the joint velocity limits in the simulation, applying an Exponential Moving Average (EMA, equation below), or penalising via rewards. I have had mixed results with all three and am still looking for a robust solution that doesn’t require super careful tuning. Applying limits or penalties can easily kill exploration thus performance, but the EMA reduces the Markovian-ness and makes the task harder to solve.
\(a_t = \tau \cdot \pi(o_t) + (1-\tau)\cdot a_{t-1}\)Frequencies. Your physics should run at >= 240 Hz for contact-rich tasks, but your RL control frequency should be lower. Experiment with decimation factors of 2, 4, or 8 to find the balance between reactive control and training speed.
Rewards
Rewards are the hardest part of RL for me. It is very difficult to express complex behaviour with a single scalar. I have so far spent endless months just tuning away scales and different combinations. This was useful to see for myself how brittle the algorithms are, but is something I now actively avoid.
Simplicity is best. Yes, you got the robot to do something with 15 task rewards. But this result is unlikely to be reproduced by others, and probably won’t translate to other tasks. While cyclical “toy tasks” can benefit from sparse bonuses (e.g. bouncing a ball, Baoding ball rotation), the majority of robotics tasks require carefully thought out dense rewards.
Start by stretching. Can your robot even reach out and “stretch”? Before you ask it to juggle three balls, make sure it can move its arm toward a point. This was also a good way to test sim2real.
Distance functions. Use a hyperbolic tangent or exponential function to transform Euclidean distance into a (0, 1] reward. And if the “tail” of your reward isn’t wide enough, the agent will never find the gradient.
Avoid penalties where possible. I started out by applying a -10 penalty for my Bounce and Baoding ball tasks whenever the robot hand dropped the ball. But I later realised that can actually impede learning. Instead, just let the agent learn that “end of episode = no more potential reward.”
Terminations & Resets
Truncation. To begin with, make the episode as short as possible for your task for faster learning. E.g. picking up an object does not take 10 seconds, it takes 3 seconds.
Terminations. Terminations are your friend! Use them liberally to avoid bad behaviour and speed up learning. If you have more than one termination condition (you should), make sure you track them individually on wandb to see which one’s are being triggered.
Randomisations. At the start of each episode, apply small randomisations to your robot joints and the poses of your objects to avoid overfitting.
3. RL: Don't start from scratch, but make it yours
Which RL algorithm should you use? Start out with PPO. It’s the dominant approach for training robots (on-policy approaches thrive with GPU-parallelised environments). Then, if you’re interested in off-policy options like SAC, check out Antonin Raffin’s blogpost “Getting SAC to Work on a Massive Parallel Simulator”.
For PPO users, the “The 37 Implementation Details of Proximal Policy Optimization” (ICLR 2022 blog track) is essential reading. I think we need a version of this for robotics! Until then, here is my advice for building your own robot learning stack:
Don’t code from scratch, but re-implement as much as possible. There is immense value in implementing things yourself, but building from a blank script is a recipe for months of debugging. Take a working example of a robot learning something from a library (e.g. roto), fork/private clone it, and then reimplement modules one by one.
Single-file implementation. Modularity often disperses crucial implementation details. Unless you are maintaining a public library, put your algorithm into a single file so you can see the entire logic at once e.g.
ppo.pyTraining/evaluation split. I see many robotics-RL papers that plot “Training reward”, because this what the library gives you. Instead, we really need to see the “Average evaluation return”. With GPU-parallelised RL, I manually split my Isaac Lab farm into 100 environments that continuously evaluate (deterministic) while the rest handle training. Here is an illustrated example of 100 environments with 10 set as evaluation envs:
Start with ground-truth observations. RL agents thrive on ground-truth states because they are continuous and relate directly to the task. If you have a new task idea, start with these (e.g., object pose relative to end-effector, object velocity) to sound-board the concept. Once it works, transition to “real-world” observations like pixels or tactile data. BUT know that…
…RL agents will struggle to integrate multimodal observations. e.g. if you have proprioception, pixels, and tactile, it is likely the pixels will completely dominate the learning. For example, here is an agent with proprioceptive, tactile, and RGB observations tasked with finding and lifting up an object inside a box. Instead of using the tactile observations, this agent learned to knock the box over to find the object. I recommend using self-supervised techniques to handle the representation learning from multimodal observations (see my repo multimodal_rl for how to do this).
Coupled vs. decoupled policy and value function. Evidence is conflicting on whether to share weights between policy and value functions. I chose the middle ground: a shared initial trunk with separate multi-layer “heads” for the policy and value functions. This approach also lets me do self-supervised learning on this shared trunk/encoder. Good to explore different architectures yourself.
4. Hyperparameters are your friend
Whenever I hear “maybe RL doesn’t work for this task”, my brain sees this:
This was a hyperparameter sweep I performed as a last-ditch effort for a basic task I had been struggling with for weeks. The problem wasn’t the physics, MDP, or RL algorithm. It was the hyperparameters! I had avoided tuning because it felt like “too much effort” compared to tinkering with reward functions - a weird cognitive dissonance that almost cost me the task. Thus, remind yourself often that hyperparameters are your friend, not foe.
Steps to become a hyperparameter superstar:
Study up. Read and follow the guidelines in “Hyperparameters in Reinforcement Learning and How To Tune Them” (Eimer et al. ICML 2023). They recommend optimising over all 11 PPO hyperparameters, but if you are new to this just choose a subset to build confidence. Something is better than nothing.
Automate everything. Do not attempt to tune manually; sadly the grid search is not going to help you much for robotics environments. Use a dedicated hyperparameter optimisation library, I like Optuna. Check out Antonin Raffin’s blog posts for a nice introduction. I have an example for how to use Optuna with Isaac Lab here.
Start early. Perform hyperparameter optimisation AND multi-seed runs as soon as possible. Do not wait for the final paper write-up to see if your results hold.
Mean matters, not maximum. A great RL contribution isn’t just one that improves the mean evaluation return; it’s one that improves the mean of the of mean evaluation returns (5+ seeds). You want a policy that is robust, not just one that got lucky once.
5. WandB: Mission Control
Your WandB dashboard should look like a NASA mission control center. It is a high-fidelity overview of every heartbeat in your system. Plot everything you can think of.
The full spectrum of evaluation. Don’t just plot the mean. Plot the mean, median, max, and min evaluation returns. You will be surprised how different the mean can look from the median. These metrics also completely hide the min and max performance, which is important to keep an eye on. If the min is stuck at zero (like below), maybe the randomisations are too extreme.
The return breakdown. This is your primary debugging tool. Plot the individual contributions of every reward component. If the mean returns are stuck, the return breakdown will tell you if the agent is over-optimising for an energy bonus while ignoring the task goal.
Physical metrics. End-effector euclidean distances to target, joint velocity, tactile forces, object position and velocities etc.
RL training: value loss, policy loss, policy standard deviation, value scaler mean, value scaler std dev, learning rate schedule
6. Play the long game
I started piano when I was five, but I was likely eight before I played anything my parents actually enjoyed listening to. For those first three years, it was mostly dreary scales and missed notes. Now, I can pick up a piece and just play.
I’ve found reinforcement learning to be a remarkably similar experience (albeit on a shorter timescale, thankfully). In the beginning, my agents were like my five-year-old fingers. It seemed they had a will of their own, flailing hopelessly around the simulator. Now, we work in harmony. My first "piece" took months (the Find environment in roto); now, I can iterate on new tasks in a day.
Remember that with RL, you are playing the long game. There will be months of learning, debugging, and head-scratching with almost nothing to show for it. It’s a quiet, invisible kind of progress that requires trust in the process.
Everyone’s timeline is unique, but for perspective: my first project required nine months of learning the ropes (RL, Isaac Lab, etc.) to figure out what I even wanted to do, and then another nine months to do it and write up a paper. To give you an idea of the scale, here is the WandB project: it shows nearly 7,000 runs of trial & error.
If you are stuck, don’t suffer in silence. Although you may not realise it, you are part of a wonderful and well connected research community. Here’s things you can try:
Search the internet or ask our new AI friends.
Talk to other RL researchers inside your university or company.
Create an issue on the GitHub repository you are working on. For example, the Isaac Lab team always responded to me quickly and helped me.
Reach out to the global community. Email the authors from your favorite papers, message people on LinkedIn, etc.
When I was really struggling to make a humanoid robot learn my task on a sweltering summer afternoon in Tokyo, my collaborator Tamon looked at me and said,
“Ganbatte Elle!” (頑張って)
It’s a word that simultaneously translates to “do your best,” “hang in there,” and “keep at it!” Ultimately, that is my final piece of advice for anyone attempting RL for robotics. The game is long but the result of seeing your agent finally master a complex physical task is incredibly rewarding. Ganbatte, my friend!
Citation
@article{miller2026theartofrobotrl,
title = "The art of robot reinforcement learning",
author = "Miller, Elle",
journal = "parallelles.substack.com",
year = "2026",
url = "https://parallelles.substack.com/p/the-art-of-robot-reinforcement-learning"
}
title = "The art of robot reinforcement learning",
author = "Miller, Elle",
journal = "parallelles.substack.com",
year = "2026",
url = "https://parallelles.substack.com/p/the-art-of-robot-reinforcement-learning"
}
References
Geist et al., “Learning with 3D Rotations: a Hitchhiker’s Guide to SO(3)” (ICML 2024). Codebase: https://github.com/martius-lab/hitchhiking-rotations
Sferrazza et al., “HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation” (2024). Project page: https://humanoid-bench.github.io/
Eimer et al., “Hyperparameters in Reinforcement Learning and How To Tune Them” (ICML 2023)
Blog posts
Credits reel: here are some fun agent videos in the process of creating roto - warning that some contain music! 🎶🎶🎶












