Text normalization for low-resource languages: the case of Ligurian
- URL: http://arxiv.org/abs/2206.07861v2
- Date: Fri, 22 Dec 2023 06:33:04 GMT
- Title: Text normalization for low-resource languages: the case of Ligurian
- Authors: Stefano Lusito and Edoardo Ferrante and Jean Maillard
- Abstract summary: We show that a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian.
- Score: 8.27203430509479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text normalization is a crucial technology for low-resource languages which
lack rigid spelling conventions or that have undergone multiple spelling
reforms. Low-resource text normalization has so far relied upon hand-crafted
rules, which are perceived to be more data efficient than neural methods. In
this paper we examine the case of text normalization for Ligurian, an
endangered Romance language. We collect 4,394 Ligurian sentences paired with
their normalized versions, as well as the first open source monolingual corpus
for Ligurian. We show that, in spite of the small amounts of data available, a
compact transformer-based model can be trained to achieve very low error rates
by the use of backtranslation and appropriate tokenization.
Related papers
- Mitigating Translationese in Low-resource Languages: The Storyboard Approach [9.676710061071809]
We propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences.
Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text.
We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency.
arXiv Detail & Related papers (2024-07-14T10:47:03Z) - Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR [0.532018200832244]
This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages.
We minimally augment the baseline language model with word unigram counts that are present in a larger text corpus of the target language but absent in the baseline.
We obtain 21.8% (Telugu) and 41.8% (Kannada) relative word error reduction with our proposed method.
arXiv Detail & Related papers (2024-03-16T14:34:31Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - A Chat About Boring Problems: Studying GPT-based text normalization [22.64840464909988]
We show the capacity of Large-Language Models for text normalization in few-shot scenarios.
We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems.
We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
arXiv Detail & Related papers (2023-09-23T16:32:59Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z) - An End-to-end Chinese Text Normalization Model based on Rule-guided
Flat-Lattice Transformer [37.0774363352316]
We propose an end-to-end Chinese text normalization model, which accepts Chinese characters as direct input.
We also release a first publicly accessible largescale dataset for Chinese text normalization.
arXiv Detail & Related papers (2022-03-31T11:19:53Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Normalizing Text using Language Modelling based on Phonetics and String
Similarity [0.0]
We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.