Robustness of LLMs to Perturbations in Text
- URL: http://arxiv.org/abs/2407.08989v1
- Date: Fri, 12 Jul 2024 04:50:17 GMT
- Title: Robustness of LLMs to Perturbations in Text
- Authors: Ayush Singh, Navpreet Singh, Shubham Vatsal,
- Abstract summary: Large language models (LLMs) have shown impressive performance, but can they handle the inevitable noise in real-world data?
This work tackles this critical question by investigating LLMs' resilience against morphological variations in text.
Our findings show that contrary to popular beliefs, generative LLMs are quiet robust to noisy perturbations in text.
- Score: 2.0670689746336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Having a clean dataset has been the foundational assumption of most natural language processing (NLP) systems. However, properly written text is rarely found in real-world scenarios and hence, oftentimes invalidates the aforementioned foundational assumption. Recently, Large language models (LLMs) have shown impressive performance, but can they handle the inevitable noise in real-world data? This work tackles this critical question by investigating LLMs' resilience against morphological variations in text. To that end, we artificially introduce varying levels of noise into a diverse set of datasets and systematically evaluate LLMs' robustness against the corrupt variations of the original text. Our findings show that contrary to popular beliefs, generative LLMs are quiet robust to noisy perturbations in text. This is a departure from pre-trained models like BERT or RoBERTa whose performance has been shown to be sensitive to deteriorating noisy text. Additionally, we test LLMs' resilience on multiple real-world benchmarks that closely mimic commonly found errors in the wild. With minimal prompting, LLMs achieve a new state-of-the-art on the benchmark tasks of Grammar Error Correction (GEC) and Lexical Semantic Change (LSC). To empower future research, we also release a dataset annotated by humans stating their preference for LLM vs. human-corrected outputs along with the code to reproduce our results.
Related papers
- Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context [12.781022584125925]
We construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning.
Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context.
We make our code and dataset publicly available.
arXiv Detail & Related papers (2024-10-21T14:47:37Z) - RAC: Efficient LLM Factuality Correction with Retrieval Augmentation [8.207682890286957]
Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs.
This paper introduces a simple but effective low-latency post-correction method, textbfRetrieval Augmented Correction (RAC), aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-21T06:11:38Z) - Learning to Rewrite: Generalized LLM-Generated Text Detection [19.9477991969521]
Large language models (LLMs) can be abused at scale to create non-factual content and spread disinformation.
We propose training an LLM to rewrite input text, producing minimal edits for LLM-generated content and more edits for human-written text.
Our work suggests that LLM can effectively detect machine-generated text if they are trained properly.
arXiv Detail & Related papers (2024-08-08T05:53:39Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction.
Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z) - Validating Large Language Models with ReLM [11.552979853457117]
Large language models (LLMs) have been touted for their ability to generate natural-sounding text.
There are growing concerns around possible negative effects of LLMs such as data memorization, bias, and inappropriate language.
We introduce ReLM, a system for validating and querying LLMs using standard regular expressions.
arXiv Detail & Related papers (2022-11-21T21:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.