WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
Abstract Overview
WBench is a benchmark for evaluating interactive video world models in multi-turn settings rather than only single-shot video quality. The benchmark organizes evaluation along five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance. It contains 289 test cases and 1,058 interaction turns spanning diverse scenes, styles, subjects, first- and third-person perspectives, and four interaction types: navigation, subject action, event editing, and perspective switching. The evaluation suite uses 22 automatic sub-metrics, and the paper reports experiments on 20 state-of-the-art models under a unified protocol.
Novelty
The paper's main novelty is a unified open-domain benchmark that jointly covers both first- and third-person perspectives, four interaction types, and five evaluation dimensions for interactive world models. It also introduces a unified navigation interface across text, 6-DoF camera-pose, and discrete-action control, together with an automatic 22-metric evaluation pipeline validated against human judgments.
Results
Across 20 evaluated models, the study finds that no single model performs strongly across all five dimensions. Navigation performance is better for models with native camera or action control, while text-driven models generally lead on setting adherence and physics-related scores. The analysis also shows that navigation is relatively decoupled from other capabilities, perspective switching is especially difficult, and automated metric rankings align closely with human preference.
Key Points
- WBench covers 289 cases and 1,058 turns with multi-turn interactions across navigation, subject action, event editing, and perspective switching.
- The benchmark evaluates models with 22 automatic sub-metrics spanning video quality, setting adherence, interaction adherence, consistency, and physics compliance.
- Experiments on 20 models show no overall winner and reveal distinct trade-offs between controllability, consistency, scene adherence, and physical plausibility.
References
- arXiv: https://arxiv.org/abs/2605.25874v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.25874v1
- Hugging Face Papers: https://huggingface.co/papers/2605.25874