Related papers: Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

URL: http://arxiv.org/abs/2512.16905v1
Date: Thu, 18 Dec 2025 18:57:58 GMT
Title: Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
Authors: Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao,
Abstract summary: We propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs.<n>Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training.
Score: 46.7396881133767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

Related papers

Dataset Distillation via Relative Distribution Matching and Cognitive Heritage [22.61595713543967]
We introduce statistical flow matching, a stable and efficient supervised learning framework.<n>Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data.
arXiv Detail & Related papers (2026-02-05T07:18:48Z)
Dataset Distillation for Pre-Trained Self-Supervised Vision Models [43.50190223507616]
dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples.<n>We introduce a method of dataset distillation for this task called Linear Gradient Matching.<n>Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models.
arXiv Detail & Related papers (2025-11-20T18:59:57Z)
Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA)<n>Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z)
Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion [16.356794123589246]
Low-quality or scarce data has posed significant challenges for training deep neural networks in practice.<n>Diffusion Curriculum (DisCL) adjusts the image guidance level of image synthesis for each training stage.<n>DisCL focuses on lower-guidance images of high-quality to learn features as a warm-up of learning higher-guidance images that might be weak on diversity or quality.
arXiv Detail & Related papers (2024-10-17T15:33:35Z)
HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data [31.507451966555383]
We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets. We collect a diverse image-text dataset where each image is associated with multiple captions from various sources. We train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment.
arXiv Detail & Related papers (2023-12-11T05:57:09Z)
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z)
DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix. DoubleMix first generates several perturbed samples for each training data. It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.