FuguReport

Planning with the Views via Scene Self-Exploration

Authors Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li
Affiliations Stanford University / Northwestern University / Microsoft / University of Washington / University of Oxford
Categories Method / View Planning / Planning with functional view transformations, Application / 3D Scene Understanding / ScanNet point cloud environment, Evaluation / Robotic Exploration / Scene self-exploration in 3D
License CC BY 4.0

Abstract Overview

This paper studies view planning: the ability to predict how viewpoint-changing actions alter an observed scene and to compose those changes over multiple steps to reach or localize a target view. To evaluate this capability, the authors introduce ViewSuite, a benchmark built on real ScanNet scenes with 6-DoF viewpoint control and three tasks: Path-to-View, View-to-Path, and Interactive View Planning. Experiments on 13 frontier vision-language models show a clear gap between local view-action understanding and multi-turn planning: models perform reasonably on the single-turn tasks but degrade sharply on interactive planning, especially as viewpoint distance increases. To address this, the paper proposes an iterative training framework that alternates self-exploration in 3D scenes with view graph distillation, converting explored trajectories into supervised training signals.

Novelty

The work appears novel in both problem formulation and method. It introduces ViewSuite as a benchmark for multi-turn view planning in real 3D scenes with full 6-DoF control, and proposes view graph distillation, which turns even failed exploration trajectories into reusable supervision by organizing them as a graph of connected viewpoints.

Results

Across 13 frontier VLMs, the authors find that strong single-turn performance on Path-to-View and View-to-Path does not translate into strong Interactive View Planning, where the best frontier model reaches 21.4%. Their iterative self-exploration and view graph distillation framework improves Qwen2.5-VL-7B-Instruct from 2.5% to 47.8% on Interactive View Planning, exceeding the reported GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%) results. The trained model also shows evidence of transferable spatial priors, outperforming its base model after identical post-training on related ViewSuite tasks and on the external MindCube benchmark.

Key Points

  1. ViewSuite benchmarks view planning in real ScanNet-based 3D scenes through three complementary tasks that separate single-step view-transition understanding from multi-turn planning.
  2. Frontier VLMs show a planning gap: they achieve much stronger results on single-turn view reasoning than on Interactive View Planning, with performance worsening as viewpoint distance grows.
  3. The proposed iterative framework uses self-exploration plus view graph distillation to extract supervision from all trajectories, including failures, leading to large gains on Interactive View Planning and improved transfer to related spatial reasoning tasks.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.