Related papers: Symbolic Graphics Programming with Large Language Models

Symbolic Graphics Programming with Large Language Models

URL: http://arxiv.org/abs/2509.05208v1
Date: Fri, 05 Sep 2025 16:10:53 GMT
Title: Symbolic Graphics Programming with Large Language Models
Authors: Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu,
Abstract summary: Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) remains underexplored.<n>We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description.<n>We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG.
Score: 36.27405949272913
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

Related papers

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z)
Leveraging Large Language Models For Scalable Vector Graphics Processing: A Review [0.0]
Traditional vectorization techniques suffer from long processing times and excessive output complexity.<n>The advent of large language models (LLMs) has opened new possibilities for the generation, editing, and analysis of vector graphics.
arXiv Detail & Related papers (2025-03-06T21:23:17Z)
NeuralSVG: An Implicit Representation for Text-to-Vector Generation [54.4153300455889]
We propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts.<n>To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique.<n>We demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.
arXiv Detail & Related papers (2025-01-07T18:50:06Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Multi-View Empowered Structural Graph Wordification for Language Models [12.22063024099311]
We introduce an end-to-end modality-aligning framework for LLM-graph alignment: Dual-Residual Vector Quantized-Variational AutoEncoder, namely Dr.E.<n>Our approach is purposefully designed to facilitate token-level alignment with LLMs, enabling an effective translation of the intrinsic'of graphs into comprehensible natural language.<n>Our framework ensures certain visual interpretability, efficiency, and robustness, marking the promising successful endeavor to achieve token-level alignment between LLMs and GNNs.
arXiv Detail & Related papers (2024-06-19T16:43:56Z)
Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z)
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis [112.25071764647683]
StrokeNUWA is a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics. equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods. StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.
arXiv Detail & Related papers (2024-01-30T15:20:26Z)
Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models [14.251972223585765]
This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, and prompts to approximate a graph's global connectivity. The study also presents GraphTMI, a novel benchmark for evaluating Large Language Models (LLMs) in graph structure analysis.
arXiv Detail & Related papers (2023-11-16T12:45:41Z)
Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding [46.042197741423365]
Large language models (LLMs) have made significant advancements in natural language understanding. This work investigates if it is possible for the LLM to understand images as well.
arXiv Detail & Related papers (2023-06-09T17:57:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.