WorldEval
WorldEval
WorldEval

World Model as Real-World Robot Policies Evaluator

Yaxuan Li*1,2 Yichen Zhu*1 Junjie Wen Chaomin Shen Yi Xu
1. Midea Group 2. East China Normal University

*Indicates equal contribution. This work was done during Yaxuan Li's internship in Midea Group.
Corresponding author.

Abstract

The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot's actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world sim- ulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

overview

Visualization of Real-World Robot Policy and Generated Video Policy

1. Success Cases

1x real world
1x generated video
1x real world
1x generated video
Instruction:Clean the table.
Instruction:Clean the table.
1x real world
1x generated video
1x real world
1x generated video
Instruction:Collect the toy to the right tray. not in train data
Instruction:Place the empty blue cup to the cup mat.
1x real world
1x generated video
1x real world
1x generated video
Instruction:Pass the red block to the right arm to place it on the blue mat.
Instruction:Clean the table.
1x real world
1x generated video
1x real world
1x generated video
Instruction:Clean the table.
Instruction:Clean the table.

2. Failure Cases

1x real world
1x generated video
1x real world
1x generated video
Instruction:Clean the table.
Instruction:Clean the table.
1x real world
1x generated video
1x real world
1x generated video
Instruction:Collect the toy to the right tray. not in train data
Instruction:Place the empty blue cup to the cup mat.
1x real world
1x generated video
1x real world
1x generated video
Instruction:Pass the red block to the right arm to place it on the blue mat.
Instruction:Pick up the hammer, then strike the red block.
1x real world
1x generated video
1x real world
50 1x generated video
Instruction:Clean the table.
Instruction:Clean the table.

3. Cases with different data collection frequency

1x real world collected at 10 Hz
1x generated video
1x real world collected at 50 Hz
1x generated video
Instruction:Pass the red block to the right arm to place it on the blue mat.
Instruction:Pass the red block to the right arm to place it on the blue mat.
1x real world collected at 10 Hz
1x generated video
1x real world collected at 50 Hz
1x generated video
Instruction:Place the empty blue cup to the cup mat.
Instruction:Place the empty blue cup to the cup mat.

Experimental Results

teaser

Real vs. WorldEval success rates.

teaser

WorldEval vs. Real-to-Sim evaluation.

teaser

Policy2Vec versus other encoding methods.

BibTeX

@misc{li2025worldevalworldmodelrealworld,
      title={WorldEval: World Model as Real-World Robot Policies Evaluator}, 
      author={Yaxuan Li and Yichen Zhu and Junjie Wen and Chaomin Shen and Yi Xu},
      year={2025},
      eprint={2505.19017},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.19017}, 
}