BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
- URL: http://arxiv.org/abs/2504.01786v1
- Date: Wed, 02 Apr 2025 14:51:45 GMT
- Title: BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
- Authors: Yunqi Gu, Ian Huang, Jihyeon Je, Guandao Yang, Leonidas Guibas,
- Abstract summary: We present BlenderGym, the first comprehensive vision-language models (VLMs) system benchmark for 3D graphics editing.<n>We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users.
- Score: 4.268804603388096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and presents real-world editing complexity. In this work, we present BlenderGym, the first comprehensive VLM system benchmark for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D reconstruction tasks. We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users. Enabled by BlenderGym, we study how inference scaling techniques impact VLM's performance on graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through inference scaling, complementing recent insights on inference scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by strategically distributing it between generation and verification.
Related papers
- LLM-Driven 3D Scene Generation of Agricultural Simulation Environments [1.002902747701998]
Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design.<n>This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts.<n>A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine.
arXiv Detail & Related papers (2026-02-12T08:33:01Z) - How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning [105.35082963701541]
VIGA (Vision-as-Inverse-Graphic Agent) reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure.<n>To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory.
arXiv Detail & Related papers (2026-01-16T09:11:55Z) - VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement [66.13644883379087]
We tackle three key challenges in 3D object arrangement task using MLLMs.<n>First, to address the weak visual grounding of MLLMs, we introduce an MCP-based API.<n>Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools.<n>Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework.
arXiv Detail & Related papers (2025-12-26T19:22:39Z) - UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying [64.5307229755533]
We introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability.<n>We implement our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
arXiv Detail & Related papers (2025-08-05T06:42:09Z) - IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering [7.247417417159471]
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain.<n>We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition.
arXiv Detail & Related papers (2025-06-29T17:02:57Z) - FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [47.8417810406568]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)
Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.
Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing.
ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics.
We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z) - What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? [30.685102474291046]
Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision.
We take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods.
We show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators.
arXiv Detail & Related papers (2025-03-20T04:58:53Z) - Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review [0.0]
Traditional vectorization techniques suffer from long processing times and excessive output complexity.<n>The advent of large language models (LLMs) has opened new possibilities for the generation, editing, and analysis of vector graphics.
arXiv Detail & Related papers (2025-03-06T21:23:17Z) - ConvMesh: Reimagining Mesh Quality Through Convex Optimization [55.2480439325792]
This research introduces a convex optimization programming called disciplined convex programming to enhance existing meshes.
By focusing on a sparse set of point clouds from both the original and target meshes, this method demonstrates significant improvements in mesh quality with minimal data requirements.
arXiv Detail & Related papers (2024-12-11T15:48:25Z) - MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z) - BlenderAlchemy: Editing 3D Graphics with Vision-Language Models [4.852796482609347]
A vision-based edit generator and state evaluator work together to find the correct sequence of actions to achieve the goal.
Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of Vision-Language Models with "imagined" reference images.
arXiv Detail & Related papers (2024-04-26T19:37:13Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - 3D-GPT: Procedural 3D Modeling with Large Language Models [47.72968643115063]
We introduce 3D-GPT, a framework utilizing large language models(LLMs) for instruction-driven 3D modeling.
3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task.
Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers.
arXiv Detail & Related papers (2023-10-19T17:41:48Z) - gradSim: Differentiable simulation for system identification and
visuomotor control [66.37288629125996]
We present gradSim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering.
Our unified graph enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision.
arXiv Detail & Related papers (2021-04-06T16:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.