Planning with the Views via Scene Self-Exploration
Abstract Overview
This paper studies view planning: the ability to predict how viewpoint-changing actions alter an observed scene and to compose those changes over multiple steps to reach or localize a target view. To evaluate this capability, the authors introduce ViewSuite, a benchmark built on real ScanNet scenes with 6-DoF viewpoint control and three tasks: Path-to-View, View-to-Path, and Interactive View Planning. Experiments on 13 frontier vision-language models show a clear gap between local view-action understanding and multi-turn planning: models perform reasonably on the single-turn tasks but degrade sharply on interactive planning, especially as viewpoint distance increases. To address this, the paper proposes an iterative training framework that alternates self-exploration in 3D scenes with view graph distillation, converting explored trajectories into supervised training signals.
Novelty
The work appears novel in both problem formulation and method. It introduces ViewSuite as a benchmark for multi-turn view planning in real 3D scenes with full 6-DoF control, and proposes view graph distillation, which turns even failed exploration trajectories into reusable supervision by organizing them as a graph of connected viewpoints.
Results
Across 13 frontier VLMs, the authors find that strong single-turn performance on Path-to-View and View-to-Path does not translate into strong Interactive View Planning, where the best frontier model reaches 21.4%. Their iterative self-exploration and view graph distillation framework improves Qwen2.5-VL-7B-Instruct from 2.5% to 47.8% on Interactive View Planning, exceeding the reported GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%) results. The trained model also shows evidence of transferable spatial priors, outperforming its base model after identical post-training on related ViewSuite tasks and on the external MindCube benchmark.
Key Points
- ViewSuite benchmarks view planning in real ScanNet-based 3D scenes through three complementary tasks that separate single-step view-transition understanding from multi-turn planning.
- Frontier VLMs show a planning gap: they achieve much stronger results on single-turn view reasoning than on Interactive View Planning, with performance worsening as viewpoint distance grows.
- The proposed iterative framework uses self-exploration plus view graph distillation to extract supervision from all trajectories, including failures, leading to large gains on Interactive View Planning and improved transfer to related spatial reasoning tasks.