Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
- URL: http://arxiv.org/abs/2602.09439v1
- Date: Tue, 10 Feb 2026 06:06:54 GMT
- Title: Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning
- Authors: Xu Ma, Yitian Zhang, Qihua Dong, Yun Fu,
- Abstract summary: Fine-T2I is a large-scale, high-quality, and fully open dataset for text-to-image fine-tuning.<n>All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality.<n>The final dataset contains over 6 million text-image pairs, around 2 TB on disk.
- Score: 39.35923155873977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.
Related papers
- Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification [31.116511358786084]
Text-to-image (T2I) models are increasingly used for synthetic dataset generation.<n>Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data.<n>We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification.
arXiv Detail & Related papers (2025-10-28T05:40:14Z) - Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs [36.42060582800515]
We introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models.<n>TPO works by training the model to prefer matched prompts over mismatched prompts.<n>Our framework is general and compatible with existing preference-based algorithms.
arXiv Detail & Related papers (2025-09-30T04:32:34Z) - AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models [58.85362281293525]
We introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts.<n>We experimentally validate that leading T2I models do not fare well on AcT2I.<n>We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation.
arXiv Detail & Related papers (2025-09-19T16:41:39Z) - Scalable Ranked Preference Optimization for Text-to-Image Generation [76.16285931871948]
We investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training.
The preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process.
We introduce RankDPO to enhance DPO-based methods using the ranking feedback.
arXiv Detail & Related papers (2024-10-23T16:42:56Z) - VersaT2I: Improving Text-to-Image Models with Versatile Reward [32.30564849001593]
VersaT2I is a versatile training framework that can boost the performance of any text-to-image (T2I) model.
We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc.
arXiv Detail & Related papers (2024-03-27T12:08:41Z) - Improving Text-to-Image Consistency via Automatic Prompt Optimization [26.2587505265501]
We introduce a T2I optimization-by-prompting framework, OPT2I, to improve prompt-image consistency in T2I models.
Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score.
arXiv Detail & Related papers (2024-03-26T15:42:01Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency.<n>We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models.<n>We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [62.81033771780328]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.<n>It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.<n>The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.