VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
- URL: http://arxiv.org/abs/2511.02778v1
- Date: Tue, 04 Nov 2025 18:00:18 GMT
- Title: VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
- Authors: Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang,
- Abstract summary: We advocate SVG code as a compact, interpretable, and executable visual representation.<n>We introduce VCode, a benchmark that reframes multimodal understanding as code generation.<n>VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench)
- Score: 51.95090758710288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.
Related papers
- Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure [57.89872230703339]
We introduce a framework that recovers the semantic structure required for reliable SVG animation.<n>By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence.
arXiv Detail & Related papers (2025-12-16T12:03:46Z) - DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance [48.98604326855894]
We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner.<n>At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality.
arXiv Detail & Related papers (2025-12-11T18:23:03Z) - SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation [47.390332111383294]
We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process.<n>Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code.<n> Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs.
arXiv Detail & Related papers (2025-09-29T05:25:00Z) - UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models [9.310212949500011]
We propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation.<n>UniSVG is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.)<n>As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V.
arXiv Detail & Related papers (2025-08-11T08:50:14Z) - SVGen: Interpretable Vector Graphics Generation with Large Language Models [61.62816031675714]
We introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions.<n>We create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance.<n>Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs.
arXiv Detail & Related papers (2025-08-06T15:00:24Z) - Rendering-Aware Reinforcement Learning for Vector Graphics Generation [15.547843461605746]
We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in vision-language models (VLMs)<n>Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward.<n>This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs.
arXiv Detail & Related papers (2025-05-27T06:56:00Z) - Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning.<n>We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.