TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
- URL: http://arxiv.org/abs/2603.03072v1
- Date: Tue, 03 Mar 2026 15:17:56 GMT
- Title: TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning
- Authors: Christian Greisinger, Steffen Eger,
- Abstract summary: Existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ.<n>We construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality.
- Score: 21.738227405440785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.
Related papers
- Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Scaling Down Text Encoders of Text-to-Image Diffusion Models [24.751226627178475]
Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL.<n>We employ vision-based knowledge distillation to train a series of T5 encoder models.<n>Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL.
arXiv Detail & Related papers (2025-03-25T17:55:20Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ [32.12690388609568]
DeTikZify is a novel language model that automatically synthesizes scientific figures as semantics-preserving TikZ graphics programs.
We create three new datasets: DaTikZv2, SketchFig, and MetaFig.
We train DeTikZify on MetaFig and DaTikZv2, along with synthetically generated sketches learned from SketchFig.
arXiv Detail & Related papers (2024-05-24T07:48:35Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [77.59074909960913]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)<n>We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.<n>To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - TextSquare: Scaling up Text-Centric Visual Instruction Tuning [62.878378882175284]
We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M.<n>Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs.<n>It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
arXiv Detail & Related papers (2024-04-19T11:38:08Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Learning to Decompose: Hypothetical Question Decomposition Based on
Comparable Texts [65.84370471189676]
We look at large-scale intermediate pre-training of decomposition-based transformers using distant supervision from comparable texts.
We show that with such intermediate pre-training, developing robust decomposition-based models for a diverse range of tasks becomes more feasible.
arXiv Detail & Related papers (2022-10-30T15:38:03Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.