Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling
- URL: http://arxiv.org/abs/2105.11872v1
- Date: Tue, 25 May 2021 12:15:45 GMT
- Title: Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling
- Authors: Marcin Namysl, Sven Behnke, Joachim K\"ohler
- Abstract summary: We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text.
To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings.
Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
- Score: 26.27504889360246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances, standard sequence labeling systems often fail when
processing noisy user-generated text or consuming the output of an Optical
Character Recognition (OCR) process. In this paper, we improve the noise-aware
training method by proposing an empirical error generation approach that
employs a sequence-to-sequence model trained to perform translation from
error-free to erroneous text. Using an OCR engine, we generated a large
parallel text corpus for training and produced several real-world noisy
sequence labeling benchmarks for evaluation. Moreover, to overcome the data
sparsity problem that exacerbates in the case of imperfect textual input, we
learned noisy language model-based embeddings. Our approach outperformed the
baseline noise generation and error correction techniques on the erroneous
sequence labeling data sets. To facilitate future research on robustness, we
make our code, embeddings, and data conversion scripts publicly available.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Optimized Tokenization for Transcribed Error Correction [10.297878672883973]
We show that the performance of correction models can be significantly increased by training solely using synthetic data.
Specifically, we show that synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations.
arXiv Detail & Related papers (2023-10-16T12:14:21Z) - Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models [39.37532848489779]
We propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data.
We show that ENT improves generation quality over standard training and previous soft and hard truncation methods.
arXiv Detail & Related papers (2023-10-02T01:30:27Z) - You Can Generate It Again: Data-to-text Generation with Verification and
Correction Prompting [20.89979858757123]
We propose a novel approach that goes beyond traditional one-shot generation methods by introducing a multi-step process.
The observations from the verification step are converted into a specialized error-indication prompt, which instructs the model to regenerate the output.
This procedure enables the model to incorporate feedback from the error-indication prompt, resulting in improved output generation.
arXiv Detail & Related papers (2023-06-28T05:34:25Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Understanding Model Robustness to User-generated Noisy Texts [2.958690090551675]
In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors.
We propose to model the errors statistically from grammatical-error-correction corpora.
arXiv Detail & Related papers (2021-10-14T14:54:52Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - A Self-Refinement Strategy for Noise Reduction in Grammatical Error
Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets.
There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected.
We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z) - NAT: Noise-Aware Training for Robust Neural Sequence Labeling [30.91638109413785]
We propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on input.
Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation.
Experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models.
arXiv Detail & Related papers (2020-05-14T17:30:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.