Historical German Text Normalization Using Type- and Token-Based Language Modeling
- URL: http://arxiv.org/abs/2409.02841v1
- Date: Wed, 4 Sep 2024 16:14:05 GMT
- Title: Historical German Text Normalization Using Type- and Token-Based Language Modeling
- Authors: Anton Ehrmanntraut,
- Abstract summary: This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
Related papers
- Is text normalization relevant for classifying medieval charters? [0.0]
This study examines the impact of historical text normalization on the classification of medieval charters.
Our results indicate that the given normalization minimally improves locating tasks but reduces accuracy for dating.
Results suggest a selective approach to historical text normalization, emphasizing the significance of preserving some textual characteristics.
arXiv Detail & Related papers (2024-08-29T11:19:57Z) - Neural machine translation for automated feedback on children's
early-stage writing [3.0695550123017514]
We address the problem of assessing and constructing feedback for early-stage writing automatically using machine learning.
We propose to use sequence-to-sequence models for "translating" early-stage writing by students into "conventional" writing.
arXiv Detail & Related papers (2023-11-15T21:32:44Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - The Whole Truth and Nothing But the Truth: Faithful and Controllable
Dialogue Response Generation with Dataflow Transduction and Constrained
Decoding [65.34601470417967]
We describe a hybrid architecture for dialogue response generation that combines the strengths of neural language modeling and rule-based generation.
Our experiments show that this system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.
arXiv Detail & Related papers (2022-09-16T09:00:49Z) - Text normalization for low-resource languages: the case of Ligurian [8.27203430509479]
We show that a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian.
arXiv Detail & Related papers (2022-06-16T00:37:55Z) - Sequence-to-Sequence Lexical Normalization with Multilingual
Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication.
This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data.
We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z) - GTAE: Graph-Transformer based Auto-Encoders for Linguistic-Constrained
Text Style Transfer [119.70961704127157]
Non-parallel text style transfer has attracted increasing research interests in recent years.
Current approaches still lack the ability to preserve the content and even logic of original sentences.
We propose a method called Graph Transformer based Auto-GTAE, which models a sentence as a linguistic graph and performs feature extraction and style transfer at the graph level.
arXiv Detail & Related papers (2021-02-01T11:08:45Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Normalizing Text using Language Modelling based on Phonetics and String
Similarity [0.0]
We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.