Coding the Visual World: From Image to Simulation Using Vision Language Models
- URL: http://arxiv.org/abs/2601.05344v1
- Date: Thu, 08 Jan 2026 19:49:05 GMT
- Title: Coding the Visual World: From Image to Simulation Using Vision Language Models
- Authors: Sagi Eppel,
- Abstract summary: This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate systems in images.<n>The VLM is given a natural image of a real-world system and is tasked with describing the system and writing code that simulates and generates it.<n>This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations.
- Score: 2.6034777771586946
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
Related papers
- Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling [68.14113731953971]
This paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like imagination.<n>We show that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks.
arXiv Detail & Related papers (2025-12-01T16:01:41Z) - Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models [9.24989979549793]
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks.<n>These models typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision.<n>We introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency.
arXiv Detail & Related papers (2025-09-23T16:07:18Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [49.04935506942202]
Lumina-mGPT is a family of multimodal autoregressive models capable of various vision and language tasks.<n>By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models.
arXiv Detail & Related papers (2024-08-05T17:46:53Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text.
Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.