Related papers: Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis

URL: http://arxiv.org/abs/2512.11912v1
Date: Thu, 11 Dec 2025 02:10:41 GMT
Title: Robustness of Probabilistic Models to Low-Quality Data: A Multi-Perspective Analysis
Authors: Liu Peng, Yaochu Jin,
Abstract summary: A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models.<n>We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient.<n>Under the same levels of data corruption, class-conditional diffusion models degrade catastrophically.
Score: 23.834741751854448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A systematic, comparative investigation into the effects of low-quality data reveals a stark spectrum of robustness across modern probabilistic models. We find that autoregressive language models, from token prediction to sequence-to-sequence tasks, are remarkably resilient (for GPT-2, test NLL increases modestly from 2.87 to 3.59 despite 50% token corruption). By contrast, under the same levels of data corruption, class-conditional diffusion models degrade catastrophically (image-label consistency plummets by 56.81% relative to baseline), while classifiers show a moderate impact that diminishes with dataset scale. To explain these discrepancies, we analyze the results through a multi-perspective lens, integrating information theory, PAC learning, and gradient dynamics. These analyses suggest that robustness is heavily influenced by two key principles: the richness of conditioning information, which constrains the learning problem, and the absolute information content of the training data, which allows the signal from correct information to dominate statistical noise.

Related papers

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness [2.9327666088683664]
This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private gradient.<n>We demonstrate that the noise required for privacy leads to suboptimal feature learning networks.
arXiv Detail & Related papers (2026-03-05T07:19:31Z)
Noisy Analysis of Quantum SMOTE on Condition Monitoring and Fault Classification in Industrial and Energy Systems [0.5505634045241289]
Imbalanced machine learning models are a fundamental issue in industrial condition monitoring and fault classification pipelines.<n>This work presents a detailed benchmarking and investigation of classical classifiers under class imbalance mitigation.<n>The results show that QSMOTE consistently corrects distributional skew and significantly enhances the performance of non-linear classifiers.
arXiv Detail & Related papers (2026-01-16T16:44:38Z)
From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning [27.3606707777401]
We provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong)<n>Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset.
arXiv Detail & Related papers (2025-10-28T07:53:24Z)
Robust Molecular Property Prediction via Densifying Scarce Labeled Data [53.24886143129006]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.
arXiv Detail & Related papers (2025-06-13T15:27:40Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.<n>Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)<n>Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Rethinking Benign Overfitting in Two-Layer Neural Networks [2.486161976966064]
We refine the feature-noise data model by incorporating class-dependent heterogeneous noise and re-examine the overfitting phenomenon in neural networks.<n>Our findings reveal that neural networks can leverage "data noise" to learn implicit features that improve the classification accuracy for long-tailed data.
arXiv Detail & Related papers (2025-02-17T15:20:04Z)
The Risk of Federated Learning to Skew Fine-Tuning Features and Underperform Out-of-Distribution Robustness [50.52507648690234]
Federated learning has the risk of skewing fine-tuning features and compromising the robustness of the model. We introduce three robustness indicators and conduct experiments across diverse robust datasets. Our approach markedly enhances the robustness across diverse scenarios, encompassing various parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-01-25T09:18:51Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Analyzing the Effects of Handling Data Imbalance on Learned Features from Medical Images by Looking Into the Models [50.537859423741644]
Training a model on an imbalanced dataset can introduce unique challenges to the learning problem. We look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features.
arXiv Detail & Related papers (2022-04-04T09:38:38Z)
DeepAdversaries: Examining the Robustness of Deep Learning Models for Galaxy Morphology Classification [47.38422424155742]
In morphological classification of galaxies, we study the effects of perturbations in imaging data. We show that training with domain adaptation improves model robustness and mitigates the effects of these perturbations.
arXiv Detail & Related papers (2021-12-28T21:29:02Z)
The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning [25.85044477227461]
Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" We find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.
arXiv Detail & Related papers (2021-06-30T06:21:42Z)
An Investigation of Why Overparameterization Exacerbates Spurious Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior. We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)
On the Role of Dataset Quality and Heterogeneity in Model Confidence [27.657631193015252]
Safety-critical applications require machine learning models that output accurate and calibrated probabilities. Uncalibrated deep networks are known to make over-confident predictions. We study the impact of dataset quality by studying the impact of dataset size and the label noise on the model confidence.
arXiv Detail & Related papers (2020-02-23T05:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.