Noise-Aware Named Entity Recognition for Historical VET Documents
- URL: http://arxiv.org/abs/2601.00488v1
- Date: Thu, 01 Jan 2026 21:43:35 GMT
- Title: Noise-Aware Named Entity Recognition for Historical VET Documents
- Authors: Alexander M. Esser, Jens Dörpinghaus,
- Abstract summary: We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning.<n>Our method is one of the first to recognize multiple entity types in VET documents.
- Score: 45.88028371034407
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper addresses Named Entity Recognition (NER) in the domain of Vocational Education and Training (VET), focusing on historical, digitized documents that suffer from OCR-induced noise. We propose a robust NER approach leveraging Noise-Aware Training (NAT) with synthetically injected OCR errors, transfer learning, and multi-stage fine-tuning. Three complementary strategies, training on noisy, clean, and artificial data, are systematically compared. Our method is one of the first to recognize multiple entity types in VET documents. It is applied to German documents but transferable to arbitrary languages. Experimental results demonstrate that domain-specific and noise-aware fine-tuning substantially increases robustness and accuracy under noisy conditions. We provide publicly available code for reproducible noise-aware NER in domain-specific contexts.
Related papers
- Learning to Retrieve with Weakened Labels: Robust Training under Label Noise [0.0]
We consider a label weakening approach to generate robust retrieval models in the presence of label noise.<n>Our initial results show that label weakening can improve the performance of the retrieval tasks in comparison to 10 different state-of-the-art loss functions.
arXiv Detail & Related papers (2025-12-15T11:52:13Z) - "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech.<n>New named entities appear every day, however, annotating their Spoken NER data is costly.<n>We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment [28.491885755907575]
Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance.<n>This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions.
arXiv Detail & Related papers (2024-07-25T02:30:40Z) - NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition [3.726602636064681]
We present an analysis that shows that real noise is significantly more challenging than simulated noise.
We show that current state-of-the-art models for noise-robust learning fall far short of their theoretically achievable upper bound.
arXiv Detail & Related papers (2024-05-13T10:20:31Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Noisy Pair Corrector for Dense Retrieval [59.312376423104055]
We propose a novel approach called Noisy Pair Corrector (NPC)
NPC consists of a detection module and a correction module.
We conduct experiments on text-retrieval benchmarks Natural Question and TriviaQA, code-search benchmarks StaQC and SO-DS.
arXiv Detail & Related papers (2023-11-07T08:27:14Z) - Learning to Correct Noisy Labels for Fine-Grained Entity Typing via
Co-Prediction Prompt Tuning [9.885278527023532]
We introduce Co-Prediction Prompt Tuning for noise correction in FET.
We integrate prediction results to recall labeled labels and utilize a differentiated margin to identify inaccurate labels.
Experimental results on three widely-used FET datasets demonstrate that our noise correction approach significantly enhances the quality of training samples.
arXiv Detail & Related papers (2023-10-23T06:04:07Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.