TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
- URL: http://arxiv.org/abs/2503.11509v2
- Date: Wed, 19 Mar 2025 12:42:41 GMT
- Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
- Authors: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto,
- Abstract summary: We present TikZero, which synthesizes graphics program generation from text understanding by using image representations as an intermediary bridge.<n>It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis.<n>We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs.
- Score: 56.35987342339608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
Related papers
- Fine-Grained Video Captioning through Scene Graph Consolidation [44.30028794237688]
We propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation.<n>Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions.
arXiv Detail & Related papers (2025-02-23T03:59:05Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)<n>We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.<n>To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with
TikZ [38.2820447703639]
We introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions.
We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings.
In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures.
arXiv Detail & Related papers (2023-09-30T13:15:49Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.