UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2508.07766v1
- Date: Mon, 11 Aug 2025 08:50:14 GMT
- Title: UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
- Authors: Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao,
- Abstract summary: We propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation.<n>UniSVG is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.)<n>As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V.
- Score: 9.310212949500011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.
Related papers
- DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance [48.98604326855894]
We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner.<n>At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality.
arXiv Detail & Related papers (2025-12-11T18:23:03Z) - RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance [32.59099674596894]
RoboSVG is a unified framework for generating interactive SVGs guided by textual, visual, and numerical signals.<n>To support this framework, we construct RoboDraw, a large-scale dataset of one million examples.<n>RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation.
arXiv Detail & Related papers (2025-10-26T13:57:08Z) - InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models [65.49118879021016]
We present the InternSVG family, an integrated data-benchmark-model suite.<n>At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks.<n>We propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens.
arXiv Detail & Related papers (2025-10-13T12:38:04Z) - SVGen: Interpretable Vector Graphics Generation with Large Language Models [61.62816031675714]
We introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions.<n>We create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance.<n>Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs.
arXiv Detail & Related papers (2025-08-06T15:00:24Z) - OmniSVG: A Unified Scalable Vector Graphics Generation Model [69.59073636922287]
We propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models for end-to-end multimodal SVG generation.<n>By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the synthesis of complex SVG structure.<n>We introduce MMSVG-2M, a multimodal dataset with two million annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks.
arXiv Detail & Related papers (2025-04-08T17:59:49Z) - Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning.<n>We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z) - StarVector: Generating Scalable Vector Graphics Code from Images and Text [15.32194071443065]
We introduce Star, a multimodal large language model for SVG generation.<n>It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs.<n>We train StarStack, a diverse dataset of 2M samples that enables generalization across vectorization tasks.
arXiv Detail & Related papers (2023-12-17T08:07:32Z) - SVG-Net: An SVG-based Trajectory Prediction Model [67.68864911674308]
Anticipating motions of vehicles in a scene is an essential problem for safe autonomous driving systems.
To this end, the comprehension of the scene's infrastructure is often the main clue for predicting future trajectories.
Most of the proposed approaches represent the scene with averse averseized format and some of the more recent approaches leverage custom vectorized formats.
arXiv Detail & Related papers (2021-10-07T18:00:08Z) - DeepSVG: A Hierarchical Generative Network for Vector Graphics Animation [217.86315551526235]
We propose a novel hierarchical generative network, called DeepSVG, for complex SVG icons generation and manipulation.
Our architecture effectively disentangles high-level shapes from the low-level commands that encode the shape itself.
We demonstrate that our network learns to accurately reconstruct diverse vector graphics, and can serve as a powerful animation tool.
arXiv Detail & Related papers (2020-07-22T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.