Daily Paper

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning.

arXiv:2604.19092 Empirical Study
machine-learning

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

1. Introduction: The Illusion of Visual Perfection

The recent explosion of large-scale video world models—such as OpenAI’s Sora, Google’s Veo, and Alibaba’s Wan—has redefined what is possible in digital synthesis. These models generate “visually plausible” futures where objects move smoothly and lighting remains consistent. However, for the field of robotics, visual beauty is a secondary concern. The primary requirement is “physical executability”: the ability to translate a predicted video into a series of real-world motor commands that successfully complete a task.

The research context identifies a critical “Reality Gap.” This is the discrepancy between a video looking realistic to a human observer and being physically consistent enough for a robot to follow. We are currently witnessing a rise in “hallucinatory physics,” where generated videos show a hand picking up a cup, but if the fingers don’t actually apply pressure or if the cup’s geometry warps during the lift, a robot trying to replicate those frames will fail. To bridge this gap, researchers have moved beyond simply looking at videos to actually “executing” them in simulated environments.

2. Introducing RoboWM-Bench: The New Standard for Physical AI

Traditional benchmarks for world models focus on perceptual metrics—visual quality, semantic consistency, and temporal coherence. While useful for entertainment, these metrics do not measure whether a model understands the underlying laws of manipulation.

RoboWM-Bench was created as a manipulation-centric benchmark for embodiment-grounded evaluation. Instead of asking “Does this look real?”, it asks “Can a robot actually do this?” By converting generated videos into action sequences and testing them in physically grounded simulations, the benchmark provides a standardized, reproducible way to measure the physical intelligence of AI. It moves the goalposts from artistic synthesis to functional utility.

![SOURCE_IMAGE_1]

3. The Engine Room: How RoboWM-Bench Validates “Executability”

To turn pixels into physical movement, RoboWM-Bench utilizes a dual-path evaluation pipeline that bridges the gap between human demonstrations and robotic control:

  • Human-Centric Retargeting: Current world models are often better at generating human hands because they were trained on vast amounts of internet video. RoboWM-Bench uses HaMeR (a 3D hand pose estimation tool) to track 21 keypoints on the human hand. It then maps these points to a robot’s end-effector, translating human finger movements into position, orientation, and gripper commands.
  • Robot-Centric Inverse Dynamics (IDM): For robotic videos, the system uses an Inverse Dynamics Model (IDM) to extract actions. This model is trained in two stages:
    1. Simulation Pretraining: The model learns motion on large-scale simulation data. Crucially, background masking is used here to focus the model entirely on the arm’s movement, stripping away visual noise.
    2. Real-World Adaptation: The model is then fine-tuned on real trajectories without background masking, allowing it to adapt to complex, raw visual environments while retaining the motion priors learned in sim.

A critical component of this process is the “Real-to-Sim” framework. The benchmark uses 4D Gaussian representations to reconstruct real-world scenes in a simulator. Unlike traditional meshes, 4D Gaussians preserve high visual realism and spatial consistency, ensuring that the simulation itself isn’t the source of the “Reality Gap,” but rather a faithful mirror for testing the model’s predictions.

![SOURCE_IMAGE_24]

4. The Reality Check: Perceptual Plausibility vs. Physical Truth

One of the most striking findings of the RoboWM-Bench research is that perceptual scores (like those from PAI-Bench) are often “saturated,” while execution scores remain low.

Consider the “Banana Grasp” scenario highlighted in the research (Figure 4). In a generated video, a human hand may appear to securely hold a banana between the thumb and index finger. Visually, it looks plausible—even perfect. However, when these frames are converted into robot actions, the “grasp” fails. The simulation reveals that the model did not accurately predict the necessary contact forces or precise spatial alignment. The banana slips because the “visual grip” was an illusion. This highlights why embodiment-aware evaluation is non-negotiable; without it, we are essentially training robots to hallucinate success.

![SOURCE_IMAGE_2] ![SOURCE_IMAGE_3]

5. Performance Deep-Dive: Which Models Lead the Pack?

The results reveal a massive “Human-Robot Gap.” Human-hand videos consistently achieved higher success rates than robotic ones. This points to a significant data bottleneck: we are currently teaching robots using a medium (Internet video) that is heavily biased toward human anatomy.

Below is a comparison of task-level success rates derived from the benchmark’s findings:

ModelHuman Tasks (Success %)Robot Tasks (Success %)
Wan 2.6~77%~23%
Veo 3.1~53%~10%
Cosmos~14%~5%
LVP~48%N/A*
Cosmos-FinetuneN/A~48%

Note: LVP was not tested on robot tasks as it is an interaction-oriented model specifically trained on human data. Note that for several human tasks like “Pour Water” and “Fold Towel,” Cosmos actually scored 0%, highlighting a significant “floor” in its physical reasoning for complex interactions.

6. Why Do They Fail? Identifying the Common Culprits

The analysis identified four recurring failure modes that expose the limits of current world models:

  1. Inaccurate Contact Prediction: Models often show a gripper “touching” an object but fail to model the stable grasp required to lift it. The object might “float” in the video, but physics-based execution causes it to fall.
  2. Geometric Distortions: Robotic arms in generated videos often suffer from non-physical warping. While subtle to the eye, these distortions make it impossible for an IDM to calculate a consistent joint path.
  3. Deformable Object Complexity: Tasks like “Fold Towel” proved to be the most difficult due to the complexity of contact-sensitive deformable manipulation. Modeling how cloth reacts to touch remains a frontier challenge.
  4. Bimanual Coordination: Models struggle significantly with bimanual tasks (e.g., collaborative towel folding or object handover). Synchronizing the interaction of two hands introduces a layer of coordination constraints that current world models have yet to master.

![SOURCE_IMAGE_11] ![SOURCE_IMAGE_12] ![SOURCE_IMAGE_13]

7. Conclusion: The Road Ahead for Physically Grounded AI

The introduction of RoboWM-Bench marks a shift in how we evaluate AI. It is no longer enough for a world model to be a good “artist”; it must also be a competent “physicist.”

The research leads to three critical conclusions:

  • Visual realism is a poor proxy for physical executability.
  • A major “Human-Robot Gap” exists due to training data bottlenecks.
  • Targeted fine-tuning (as seen with Cosmos-Finetune, which jumped to 48% success) is currently the most effective way to ensure a robot can actually follow a world model’s plan.

As we look toward the next generation of robotics, benchmarks that prioritize execution over appearance will be the primary guides for developing AI that can truly interact with, and not just imagine, the physical world.

Takeaway Summary

  • The Reality Gap: Videos that look real often fail the laws of physics when translated into robot actions.
  • RoboWM-Bench: A new tool that validates AI “imaginations” by converting them into actual robot actions in high-fidelity 4D Gaussian simulations.
  • The Bimanual Hurdle: Beyond single-arm tasks, synchronizing two hands remains a massive challenge for current world models.
  • The Power of Fine-Tuning: While general models like Wan 2.6 lead, manipulation-specific fine-tuning is the current key to bridging the execution gap.

Read the full paper on arXiv · PDF