Related papers: ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models

ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models

URL: http://arxiv.org/abs/2509.08972v2
Date: Fri, 12 Sep 2025 01:02:19 GMT
Title: ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models
Authors: Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhoseini, Farinaz Koushanfar,
Abstract summary: We identify model overconfidence in their self-generated data as a key driver of collapse.<n>We introduce a novel loss function we call Truncated Cross Entropy (TCE)<n>These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models.
Score: 13.096745830570944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing reliance on generative AI models has accelerated the generation rate of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030. This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. Although prior studies have explored the causes and detection of model collapse, existing mitigation strategies remain limited. In this paper, we identify model overconfidence in their self-generated data as a key driver of collapse. Building on this observation, we propose a confidence-aware loss function that downweights high-confidence predictions during training. We introduce a novel loss function we call Truncated Cross Entropy (TCE). We demonstrate that TCE significantly delays model collapse in recursive training. We provide a model-agnostic framework that links the loss function design to model collapse mitigation and validate our approach both theoretically and empirically, showing that it can extend the model's fidelity interval before collapse by more than 2.3x. Finally, we show that our method generalizes across modalities. These findings suggest that the design of loss functions provides a simple yet powerful tool for preserving the quality of generative models in the era of increasing synthetic data.

Related papers

On the Dangers of Bootstrapping Generation for Continual Learning and Beyond [8.530455607001828]
We present a statistical analysis showing that synthetic data introduces significant bias and variance into training objectives.<n>We quantify this degradation and show that state-of-the-art GER methods fail to maintain alignment in the latent space.
arXiv Detail & Related papers (2025-12-05T15:16:30Z)
Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence [31.751930228965467]
We investigate ways to modify this synthetic retraining process to avoid model collapse.<n>Our key finding is that by injecting information through an external synthetic data verifier, synthetic retraining will not cause model collapse.
arXiv Detail & Related papers (2025-10-18T22:39:39Z)
Deep Generative Continual Learning using Functional LoRA: FunLoRA [12.547444644243543]
A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting.<n>We propose a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA)<n>Our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models.
arXiv Detail & Related papers (2025-10-03T00:18:05Z)
Shifting AI Efficiency From Model-Centric to Data-Centric Compression [67.45087283924732]
We argue that the focus of research for AI is shifting from model-centric compression to data-centric compression.<n>Data-centric compression improves AI efficiency by directly compressing the volume of data processed during model training or inference.<n>Our work aims to provide a novel perspective on AI efficiency, synthesize existing efforts, and catalyze innovation to address the challenges posed by ever-increasing context lengths.
arXiv Detail & Related papers (2025-05-25T13:51:17Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Learning by Surprise: Surplexity for Mitigating Model Collapse in Generative AI [1.6545633988217645]
As synthetic content infiltrates the web, generative AI models may be retrained on their own outputs.<n>This leads to model collapse: a progressive loss of performance and diversity across generations.<n>We introduce new measures that characterise collapse directly from a model's next-token probability distributions.
arXiv Detail & Related papers (2024-10-16T08:02:48Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse [9.59833542807268]
Model collapse occurs when new models are trained on synthetic data generated from previously trained models. We show that model collapse cannot be avoided when training solely on synthetic data. We estimate a maximal amount of synthetic data below which model collapse can eventually be avoided.
arXiv Detail & Related papers (2024-04-07T22:15:13Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Model Collapse Demystified: The Case of Regression [12.115359951879462]
We study the phenomenon of "model collapse" in the era of proliferation of large language and image generation models. We obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. We propose a simple strategy based on adaptive regularization to mitigate model collapse.
arXiv Detail & Related papers (2024-02-12T15:26:01Z)
EsaCL: Efficient Continual Learning of Sparse Models [10.227171407348326]
Key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. We propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power.
arXiv Detail & Related papers (2024-01-11T04:59:44Z)
Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations. These descriptions can be used to generate synthetic data using generative models, such as diffusion models. Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers. We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z)
Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective. Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.