Related papers: Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

URL: http://arxiv.org/abs/2601.11109v2
Date: Thu, 22 Jan 2026 01:46:22 GMT
Title: Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Authors: Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng,
Abstract summary: VIGA (Vision-as-Inverse-Graphic Agent) reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure.<n>To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory.
Score: 105.35082963701541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn't require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn't require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.

Related papers

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation [59.75554954111619]
We introduce Multi-view 3D Referring Expression (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images.<n>We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning.<n>Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.
arXiv Detail & Related papers (2026-01-11T11:44:07Z)
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement [66.13644883379087]
We tackle three key challenges in 3D object arrangement task using MLLMs.<n>First, to address the weak visual grounding of MLLMs, we introduce an MCP-based API.<n>Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools.<n>Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework.
arXiv Detail & Related papers (2025-12-26T19:22:39Z)
Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis [38.10984626023432]
We introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features.<n>We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts.
arXiv Detail & Related papers (2025-12-12T14:03:16Z)
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs [19.27758108925572]
3D visual grounding identifies objects in 3D scenes from language descriptions.<n>Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms to VLM processing.<n>We propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning.
arXiv Detail & Related papers (2025-12-10T00:59:17Z)
GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering [63.17411943434755]
GraphPad is a modifiable structured memory that an agent can tailor to the needs of the task through API calls.<n>It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes.
arXiv Detail & Related papers (2025-06-01T21:13:38Z)
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames [17.975173937253494]
An embodied AI assistant operating on egocentric video must integrate spatial cues across time.<n>Disjoint-3DQA is a generative QA benchmark that evaluates this ability of VLMs.
arXiv Detail & Related papers (2025-05-30T06:32:26Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing [4.268804603388096]
We present BlenderGym, the first comprehensive vision-language models (VLMs) system benchmark for 3D graphics editing.<n>We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users.
arXiv Detail & Related papers (2025-04-02T14:51:45Z)
3D-LatentMapper: View Agnostic Single-View Reconstruction of 3D Shapes [0.0]
We propose a novel framework that leverages the intermediate latent spaces of Vision Transformer (ViT) and a joint image-text representational model, CLIP, for fast and efficient Single View Reconstruction (SVR) We use the ShapeNetV2 dataset and perform extensive experiments with comparisons to SOTA methods to demonstrate our method's effectiveness.
arXiv Detail & Related papers (2022-12-05T11:45:26Z)
Human Mesh Recovery from Multiple Shots [85.18244937708356]
We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh. We show that the resulting data is beneficial in the training of various human mesh recovery models. The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
arXiv Detail & Related papers (2020-12-17T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.