Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension
- URL: http://arxiv.org/abs/2502.16523v1
- Date: Sun, 23 Feb 2025 10:04:21 GMT
- Title: Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension
- Authors: Yulong Wu, Viktor Schlegel, Riza Batista-Navarro,
- Abstract summary: We show that natural perturbations result in performance degradation in pre-trained encoder language models.<n>More worryingly, these state-of-the-art encoder language models inherit these errors.<n>In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples.
- Score: 9.059990548158718
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder language models. More worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs) inherit these errors. Further experiments demonstrate that our findings generalise to natural perturbations found in other more challenging MRC benchmarks. In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples, though a noticeable gap still remains compared to performance on unperturbed data.
Related papers
- Towards Robust LLMs: an Adversarial Robustness Measurement Framework [0.0]
Large Language Models (LLMs) remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications.
We adapt the Robustness Measurement and Assessment framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters.
Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.
arXiv Detail & Related papers (2025-04-24T16:36:19Z) - Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations [4.39614901077936]
Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities.
The Abstraction and Reasoning Corpus benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems.
This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios.
arXiv Detail & Related papers (2025-04-22T13:43:58Z) - Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions [49.546479320670464]
This paper introduces specialized metrics for benchmarking the spatial robustness of segmentation models.
We propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness.
The results reveal that models respond to these two types of threats differently.
arXiv Detail & Related papers (2025-04-02T11:37:39Z) - Contextualizing biological perturbation experiments through language [3.704686482174365]
PerturbQA is a benchmark for structured reasoning over perturbation experiments.
We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations.
As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR), a simple, domain-informed LLM framework.
arXiv Detail & Related papers (2025-02-28T18:15:31Z) - A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted.
Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs)
Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z) - Synthetic Feature Augmentation Improves Generalization Performance of Language Models [8.463273762997398]
Training and fine-tuning deep learning models on limited and imbalanced datasets poses substantial challenges.<n>We propose augmenting features in the embedding space by generating synthetic samples using a range of techniques.<n>We validate the effectiveness of this approach across multiple open-source text classification benchmarks.
arXiv Detail & Related papers (2025-01-11T04:31:18Z) - Towards Evaluating the Robustness of Visual State Space Models [63.14954591606638]
Vision State Space Models (VSSMs) have demonstrated remarkable performance in visual perception tasks.
However, their robustness under natural and adversarial perturbations remains a critical concern.
We present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios.
arXiv Detail & Related papers (2024-06-13T17:59:44Z) - Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic [51.967603572656266]
We introduce a consistent and theoretically grounded approach to annotating decompositional entailment.
We find that our new dataset, RDTE, has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets.
We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality.
arXiv Detail & Related papers (2024-02-22T18:55:17Z) - Understanding Robust Overfitting from the Feature Generalization Perspective [61.770805867606796]
Adversarial training (AT) constructs robust neural networks by incorporating adversarial perturbations into natural data.
It is plagued by the issue of robust overfitting (RO), which severely damages the model's robustness.
In this paper, we investigate RO from a novel feature generalization perspective.
arXiv Detail & Related papers (2023-10-01T07:57:03Z) - Quantifying the robustness of deep multispectral segmentation models
against natural perturbations and data poisoning [0.0]
We characterize the performance and robustness of a multispectral (RGB and near infrared) image segmentation model subjected to adversarial attacks and natural perturbations.
We find both RGB and multispectral models are vulnerable to data poisoning attacks regardless of input or fusion architectures.
arXiv Detail & Related papers (2023-05-18T23:43:33Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - Generalized Real-World Super-Resolution through Adversarial Robustness [107.02188934602802]
We present Robust Super-Resolution, a method that leverages the generalization capability of adversarial attacks to tackle real-world SR.
Our novel framework poses a paradigm shift in the development of real-world SR methods.
By using a single robust model, we outperform state-of-the-art specialized methods on real-world benchmarks.
arXiv Detail & Related papers (2021-08-25T22:43:20Z) - Recoding latent sentence representations -- Dynamic gradient-based
activation modification in RNNs [0.0]
In RNNs, encoding information in a suboptimal way can impact the quality of representations based on later elements in the sequence.
I propose an augmentation to standard RNNs in form of a gradient-based correction mechanism.
I conduct different experiments in the context of language modeling, where the impact of using such a mechanism is examined in detail.
arXiv Detail & Related papers (2021-01-03T17:54:17Z) - Semantics Altering Modifications for Evaluating Comprehension in Machine
Reading [1.1355639618103164]
We investigate whether machine reading comprehension models are able to correctly process Semantics Altering Modifications.
We present a method to automatically generate and align challenge sets featuring original and altered examples.
We apply the methodology in order to evaluate MRC models with regard to their capability to correctly process SAM-enriched data.
arXiv Detail & Related papers (2020-12-07T21:00:42Z) - Learning perturbation sets for robust machine learning [97.6757418136662]
We use a conditional generator that defines the perturbation set over a constrained region of the latent space.
We measure the quality of our learned perturbation sets both quantitatively and qualitatively.
We leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations.
arXiv Detail & Related papers (2020-07-16T16:39:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.