LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
- URL: http://arxiv.org/abs/2503.21749v1
- Date: Thu, 27 Mar 2025 17:56:15 GMT
- Title: LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
- Authors: Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Peng Gao, Bin Fu, Zhen Li,
- Abstract summary: We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis.<n>Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1.<n>We develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina.
- Score: 56.00885545573299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
Related papers
- RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
A substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.<n>We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.<n>Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.<n>We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales.
arXiv Detail & Related papers (2025-02-18T03:58:38Z) - T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos [9.742383920787413]
T2VEval is a multi-branch fusion scheme for text-to-video quality evaluation.<n>It assesses videos across three branches: text-video consistency, realness, and technical quality.<n>T2VEval achieves state-of-the-art performance across multiple metrics.
arXiv Detail & Related papers (2025-01-15T03:11:33Z) - SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models [39.21242589835842]
We introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs.
To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs.
In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets.
arXiv Detail & Related papers (2024-07-30T11:57:40Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with
Human Feedback [31.230023678131843]
FigCaps-HF is a new framework for figure-caption generation that incorporates domain expert feedback in generating captions optimized for reader preferences.
Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences.
arXiv Detail & Related papers (2023-07-20T13:40:22Z) - RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis [104.53930611219654]
We present a large-scale synthetic dataset for novel view synthesis consisting of 300k images rendered from nearly 2000 complex scenes.
The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis.
Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures.
arXiv Detail & Related papers (2022-05-14T13:15:32Z) - NeX: Real-time View Synthesis with Neural Basis Expansion [1.471992435706872]
We present NeX, a new approach to novel view synthesis based on enhancements of multiplane image (MPI)
Our technique models view-dependent effects by instead parameterizing each pixel as a linear combination of basis functions learned from a neural network.
arXiv Detail & Related papers (2021-03-09T18:27:27Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.