PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
- URL: http://arxiv.org/abs/2509.25185v1
- Date: Mon, 29 Sep 2025 17:59:49 GMT
- Title: PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
- Authors: Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang,
- Abstract summary: PixelCraft is a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images.<n>Building on this foundation, PixelCraft facilitates visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism.
- Score: 58.73779101355669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.
Related papers
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - Thinking with Programming Vision: Towards a Unified View for Thinking with Images [23.596757163808906]
We show that even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions.<n>We propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation.
arXiv Detail & Related papers (2025-12-03T12:44:15Z) - RECODE: Reasoning Through Code Generation for Visual Question Answering [68.86938437188964]
We propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning.<n>Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
arXiv Detail & Related papers (2025-10-15T17:05:37Z) - DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning [16.880791276029964]
"Thinking with images" is a shift in Vision Language Models from text-dominant chain-of-thought to image-interactive reasoning.<n>We present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model.<n>We design a model that performs interleaved image-text reasoning and generates "visual thoughts" by operating directly in the visual embedding space.
arXiv Detail & Related papers (2025-09-30T07:02:01Z) - UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z) - Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [33.127607245587576]
We introduce a framework that enables MLLMs to learn complex visual reasoning from only raw images.<n>We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning.<n>The RRVF-trained model not only outperforms existing MLLMs and supervised fine-tuning baselines but also exhibits superior generalization.
arXiv Detail & Related papers (2025-07-28T12:21:19Z) - PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions [65.73013453373349]
PixWizard is an image-to-image visual assistant designed for image generation, manipulation, and translation based on free-from language instructions.<n>We tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning dataset.<n>Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions.
arXiv Detail & Related papers (2024-09-23T17:59:46Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.