Related papers: Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling

URL: http://arxiv.org/abs/2105.11872v1
Date: Tue, 25 May 2021 12:15:45 GMT
Title: Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling
Authors: Marcin Namysl, Sven Behnke, Joachim K\"ohler
Abstract summary: We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
Score: 26.27504889360246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances, standard sequence labeling systems often fail when processing noisy user-generated text or consuming the output of an Optical Character Recognition (OCR) process. In this paper, we improve the noise-aware training method by proposing an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text. Using an OCR engine, we generated a large parallel text corpus for training and produced several real-world noisy sequence labeling benchmarks for evaluation. Moreover, to overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings. Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets. To facilitate future research on robustness, we make our code, embeddings, and data conversion scripts publicly available.

Related papers

Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs. We propose a more realistic setting in which only noisy text and its NER labels are available. We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z)
Optimized Tokenization for Transcribed Error Correction [10.297878672883973]
We show that the performance of correction models can be significantly increased by training solely using synthetic data. Specifically, we show that synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations.
arXiv Detail & Related papers (2023-10-16T12:14:21Z)
Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models [39.37532848489779]
We propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. We show that ENT improves generation quality over standard training and previous soft and hard truncation methods.
arXiv Detail & Related papers (2023-10-02T01:30:27Z)
You Can Generate It Again: Data-to-text Generation with Verification and Correction Prompting [20.89979858757123]
We propose a novel approach that goes beyond traditional one-shot generation methods by introducing a multi-step process. The observations from the verification step are converted into a specialized error-indication prompt, which instructs the model to regenerate the output. This procedure enables the model to incorporate feedback from the error-indication prompt, resulting in improved output generation.
arXiv Detail & Related papers (2023-06-28T05:34:25Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Understanding Model Robustness to User-generated Noisy Texts [2.958690090551675]
In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. We propose to model the errors statistically from grammatical-error-correction corpora.
arXiv Detail & Related papers (2021-10-14T14:54:52Z)
Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts. We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z)
Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances. Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z)
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets. There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)
NAT: Noise-Aware Training for Robust Neural Sequence Labeling [30.91638109413785]
We propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on input. Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. Experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models.
arXiv Detail & Related papers (2020-05-14T17:30:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.