Training Flow Matching Models with Reliable Labels via Self-Purification
- URL: http://arxiv.org/abs/2509.19091v1
- Date: Tue, 23 Sep 2025 14:43:27 GMT
- Title: Training Flow Matching Models with Reliable Labels via Self-Purification
- Authors: Hyeongju Kim, Yechan Yu, June Young Yi, Juheon Lee,
- Abstract summary: We introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework.<n>SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules.<n>Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels.
- Score: 6.131772929312606
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training datasets are inherently imperfect, often containing mislabeled samples due to human annotation errors, limitations of tagging models, and other sources of noise. Such label contamination can significantly degrade the performance of a trained model. In this work, we introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework. SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules. Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels. Furthermore, we validate the robustness of SPFM on the TITW dataset, which consists of in-the-wild speech data, achieving performance that surpasses existing baselines.
Related papers
- Combating Noisy Labels through Fostering Self- and Neighbor-Consistency [120.4394402099635]
Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning.<n>We propose a noise-robust method named Jo-SNC (textbfJoint sample selection and model regularization based on textbfSelf- and textbfNeighbor-textbfConsistency)<n>We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds.
arXiv Detail & Related papers (2026-01-19T07:55:29Z) - TSFM in-context learning for time-series classification of bearing-health status [2.1569807291469454]
This paper introduces a classification method using in-context learning in time-series foundation models (TSFM)<n>We show how data, which was not part of the TSFM training data corpus, can be classified without the need of finetuning the model.<n>We apply this method to vibration data for assessing the health state of a bearing within a servo-press motor.
arXiv Detail & Related papers (2025-11-19T14:01:12Z) - Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation [2.699704259580951]
Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature.<n>We explore the use of Large Language Models (LLMs) as consistent and reliable annotators.
arXiv Detail & Related papers (2025-11-03T11:45:26Z) - DDB: Diffusion Driven Balancing to Address Spurious Correlations [24.940576844328408]
Deep neural networks trained with Empirical Risk Minimization often fail to generalize to out-of-distribution samples.<n>We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models.<n>Our experiments show that our technique achieves better worst-group accuracy than the existing state-of-the-art methods.
arXiv Detail & Related papers (2025-03-21T15:28:22Z) - Early Stopping Against Label Noise Without Validation Data [54.27621957395026]
We propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model.<n>We show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.
arXiv Detail & Related papers (2025-02-11T13:40:15Z) - Importance of Disjoint Sampling in Conventional and Transformer Models for Hyperspectral Image Classification [2.1223532600703385]
This paper presents an innovative disjoint sampling approach for training SOTA models on Hyperspectral image classification (HSIC) tasks.
By separating training, validation, and test data without overlap, the proposed method facilitates a fairer evaluation of how well a model can classify pixels it was not exposed to during training or validation.
This rigorous methodology is critical for advancing SOTA models and their real-world application to large-scale land mapping with Hyperspectral sensors.
arXiv Detail & Related papers (2024-04-23T11:40:52Z) - Label-Noise Robust Diffusion Models [18.82847557713331]
Conditional diffusion models have shown remarkable performance in various generative tasks.
Training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels.
This paper proposes Transition-aware weighted Denoising Score Matching for training conditional diffusion models with noisy labels.
arXiv Detail & Related papers (2024-02-27T14:00:34Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be
Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time.
We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z) - MAPS: A Noise-Robust Progressive Learning Approach for Source-Free
Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation.
This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required.
In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same.
Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.