WorldEval: World Model as Real-World Robot Policies Evaluator

World Model as Real-World Robot Policies Evaluator

Yaxuan Li*^1,2 Yichen Zhu*¹† Junjie Wen Chaomin Shen Yi Xu

1. Midea Group 2. East China Normal University

^*Indicates equal contribution. This work was done during Yaxuan Li's internship in Midea Group.
^†Corresponding author.

Abstract

The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot's actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world sim- ulator that follows latent action to generate the robot video. We then introduce WorldEval, an automated pipeline designed to evaluate real-world robot policies entirely online. WorldEval effectively ranks various robot policies and individual checkpoints within a single policy, and functions as a safety detector to prevent dangerous actions by newly developed robot models. Through comprehensive paired evaluations of manipulation policies in real-world environments, we demonstrate a strong correlation between policy performance in WorldEval and real-world scenarios. Furthermore, our method significantly outperforms popular methods such as real-to-sim approach.

BibTeX

@misc{li2025worldevalworldmodelrealworld, title={WorldEval: World Model as Real-World Robot Policies Evaluator}, author={Yaxuan Li and Yichen Zhu and Junjie Wen and Chaomin Shen and Yi Xu}, year={2025}, eprint={2505.19017}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2505.19017}, }

World Model as Real-World Robot Policies Evaluator

Abstract

Visualization of Real-World Robot Policy and Generated Video Policy

1. Success Cases

2. Failure Cases

3. Cases with different data collection frequency

Experimental Results

Real vs. WorldEval success rates.

WorldEval vs. Real-to-Sim evaluation.

Policy2Vec versus other encoding methods.

BibTeX