Autoguided Online Data Curation for Diffusion Model Training
- URL: http://arxiv.org/abs/2509.15267v1
- Date: Thu, 18 Sep 2025 10:09:04 GMT
- Title: Autoguided Online Data Curation for Diffusion Model Training
- Authors: Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa,
- Abstract summary: We investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models.<n>We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation.<n>Across experiments, autoguidance consistently improves sample quality and diversity.
- Score: 3.610779934162847
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.
Related papers
- Towards Understanding Valuable Preference Data for Large Language Model Alignment [85.38864561060088]
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons.<n>We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF)<n>To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule.
arXiv Detail & Related papers (2025-10-15T06:57:55Z) - Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning [35.359482937263145]
We propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration.<n>Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality.
arXiv Detail & Related papers (2025-07-17T11:13:44Z) - OASIS: Online Sample Selection for Continual Visual Instruction Tuning [55.92362550389058]
In continual instruction tuning (CIT) scenarios, new instruction tuning data continuously arrive in an online streaming manner.<n>Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models.<n>Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch.
arXiv Detail & Related papers (2025-05-27T20:32:43Z) - Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection [6.471199527741301]
We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training.<n>We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation.<n>We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence.
arXiv Detail & Related papers (2025-05-26T03:54:47Z) - Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently.
We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.