InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2510.11341v2
- Date: Tue, 04 Nov 2025 06:25:23 GMT
- Title: InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
- Authors: Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang,
- Abstract summary: We present the InternSVG family, an integrated data-benchmark-model suite.<n>At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks.<n>We propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens.
- Score: 65.49118879021016
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
Related papers
- Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure [57.89872230703339]
We introduce a framework that recovers the semantic structure required for reliable SVG animation.<n>By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence.
arXiv Detail & Related papers (2025-12-16T12:03:46Z) - DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance [48.98604326855894]
We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner.<n>At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality.
arXiv Detail & Related papers (2025-12-11T18:23:03Z) - RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance [32.59099674596894]
RoboSVG is a unified framework for generating interactive SVGs guided by textual, visual, and numerical signals.<n>To support this framework, we construct RoboDraw, a large-scale dataset of one million examples.<n>RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation.
arXiv Detail & Related papers (2025-10-26T13:57:08Z) - SVGThinker: Instruction-Aligned and Reasoning-Driven Text-to-SVG Generation [47.390332111383294]
We present SVGThinker, a reasoning-driven framework that aligns the production of SVG code with the visualization process.<n>Our pipeline first renders each primitive in sequence and uses a multimodal model to annotate the image and code.<n> Experiments against state-of-the-art baselines show that SVGThinker produces more stable, editable, and higher-quality SVGs.
arXiv Detail & Related papers (2025-09-29T05:25:00Z) - UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models [9.310212949500011]
We propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation.<n>UniSVG is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.)<n>As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V.
arXiv Detail & Related papers (2025-08-11T08:50:14Z) - SVGen: Interpretable Vector Graphics Generation with Large Language Models [61.62816031675714]
We introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions.<n>We create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance.<n>Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs.
arXiv Detail & Related papers (2025-08-06T15:00:24Z) - OmniSVG: A Unified Scalable Vector Graphics Generation Model [69.59073636922287]
We propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models for end-to-end multimodal SVG generation.<n>By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the synthesis of complex SVG structure.<n>We introduce MMSVG-2M, a multimodal dataset with two million annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks.
arXiv Detail & Related papers (2025-04-08T17:59:49Z) - DeepSVG: A Hierarchical Generative Network for Vector Graphics Animation [217.86315551526235]
We propose a novel hierarchical generative network, called DeepSVG, for complex SVG icons generation and manipulation.
Our architecture effectively disentangles high-level shapes from the low-level commands that encode the shape itself.
We demonstrate that our network learns to accurately reconstruct diverse vector graphics, and can serve as a powerful animation tool.
arXiv Detail & Related papers (2020-07-22T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.