Understanding Model Robustness to User-generated Noisy Texts
- URL: http://arxiv.org/abs/2110.07428v1
- Date: Thu, 14 Oct 2021 14:54:52 GMT
- Title: Understanding Model Robustness to User-generated Noisy Texts
- Authors: Jakub N\'aplava, Martin Popel, Milan Straka, Jana Strakov\'a
- Abstract summary: In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors.
We propose to model the errors statistically from grammatical-error-correction corpora.
- Score: 2.958690090551675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sensitivity of deep-neural models to input noise is known to be a challenging
problem. In NLP, model performance often deteriorates with naturally occurring
noise, such as spelling errors. To mitigate this issue, models may leverage
artificially noised data. However, the amount and type of generated noise has
so far been determined arbitrarily. We therefore propose to model the errors
statistically from grammatical-error-correction corpora. We present a thorough
evaluation of several state-of-the-art NLP systems' robustness in multiple
languages, with tasks including morpho-syntactic analysis, named entity
recognition, neural machine translation, a subset of the GLUE benchmark and
reading comprehension. We also compare two approaches to address the
performance drop: a) training the NLP models with noised data generated by our
framework; and b) reducing the input noise with external system for natural
language correction. The code is released at https://github.com/ufal/kazitext.
Related papers
- HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Improving the Robustness of Summarization Models by Detecting and
Removing Input Noise [50.27105057899601]
We present a large empirical study quantifying the sometimes severe loss in performance from different types of input noise for a range of datasets and model sizes.
We propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any training, auxiliary models, or even prior knowledge of the type of noise.
arXiv Detail & Related papers (2022-12-20T00:33:11Z) - Denoising Enhanced Distantly Supervised Ultrafine Entity Typing [36.14308856513851]
We build a noise model to estimate the unknown labeling noise distribution over input contexts and noisy type labels.
With the noise model, more trustworthy labels can be recovered by subtracting the estimated noise from the input.
We propose an entity typing model, which adopts a bi-encoder architecture, is trained on the denoised data.
arXiv Detail & Related papers (2022-10-18T05:20:16Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z) - NAT: Noise-Aware Training for Robust Neural Sequence Labeling [30.91638109413785]
We propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on input.
Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation.
Experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models.
arXiv Detail & Related papers (2020-05-14T17:30:06Z) - Contextual Text Denoising with Masked Language Models [21.923035129334373]
We propose a new contextual text denoising algorithm based on the ready-to-use masked language model.
The proposed algorithm does not require retraining of the model and can be integrated into any NLP system.
arXiv Detail & Related papers (2019-10-30T18:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.