Neural Text Normalization for Luxembourgish using Real-Life Variation Data
- URL: http://arxiv.org/abs/2412.09383v2
- Date: Fri, 13 Dec 2024 12:08:28 GMT
- Title: Neural Text Normalization for Luxembourgish using Real-Life Variation Data
- Authors: Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank,
- Abstract summary: We propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures.
We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
- Score: 21.370964546752294
- License:
- Abstract: Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
Related papers
- Ensembling Finetuned Language Models for Text Classification [55.15643209328513]
Finetuning is a common practice across different communities to adapt pretrained models to particular tasks.
ensembles of neural networks are typically used to boost performance and provide reliable uncertainty estimates.
We present a metadataset with predictions from five large finetuned models on six datasets and report results of different ensembling strategies.
arXiv Detail & Related papers (2024-10-25T09:15:54Z) - Context Normalization Layer with Applications [0.1499944454332829]
This study proposes a new normalization technique, called context normalization, for image data.
It adjusts the scaling of features based on the characteristics of each sample, which improves the model's convergence speed and performance.
The effectiveness of context normalization is demonstrated on various datasets, and its performance is compared to other standard normalization techniques.
arXiv Detail & Related papers (2023-03-14T06:38:17Z) - Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5
for Machine Translation [9.736284584478032]
We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited.
While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5.
We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
arXiv Detail & Related papers (2023-02-28T00:50:19Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Sequence-to-Sequence Lexical Normalization with Multilingual
Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication.
This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data.
We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z) - Consistency Regularization for Cross-Lingual Fine-Tuning [61.08704789561351]
We propose to improve cross-lingual fine-tuning with consistency regularization.
Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations.
Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks.
arXiv Detail & Related papers (2021-06-15T15:35:44Z) - Neural Inverse Text Normalization [11.240669509034298]
We propose an efficient and robust neural solution for inverse text normalization.
We show that this can be easily extended to other languages without the need for a linguistic expert to manually curate them.
A transformer based model infused with pretraining consistently achieves a lower WER across several datasets.
arXiv Detail & Related papers (2021-02-12T07:53:53Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Semi-supervised Formality Style Transfer using Language Model
Discriminator and Mutual Information Maximization [52.867459839641526]
Formality style transfer is the task of converting informal sentences to grammatically-correct formal sentences.
We propose a semi-supervised formality style transfer model that utilizes a language model-based discriminator to maximize the likelihood of the output sentence being formal.
Experiments showed that our model outperformed previous state-of-the-art baselines significantly in terms of both automated metrics and human judgement.
arXiv Detail & Related papers (2020-10-10T21:05:56Z) - Lexical Normalization for Code-switched Data and its Effect on
POS-tagging [8.875272663730868]
We propose three normalization models specifically designed to handle code-switched data.
We introduce novel normalization layers and their corresponding language ID and POS tags for the dataset.
Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models.
arXiv Detail & Related papers (2020-06-01T18:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.