Text normalization for low-resource languages: the case of Ligurian
- URL: http://arxiv.org/abs/2206.07861v2
- Date: Fri, 22 Dec 2023 06:33:04 GMT
- Title: Text normalization for low-resource languages: the case of Ligurian
- Authors: Stefano Lusito and Edoardo Ferrante and Jean Maillard
- Abstract summary: We show that a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian.
- Score: 8.27203430509479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text normalization is a crucial technology for low-resource languages which
lack rigid spelling conventions or that have undergone multiple spelling
reforms. Low-resource text normalization has so far relied upon hand-crafted
rules, which are perceived to be more data efficient than neural methods. In
this paper we examine the case of text normalization for Ligurian, an
endangered Romance language. We collect 4,394 Ligurian sentences paired with
their normalized versions, as well as the first open source monolingual corpus
for Ligurian. We show that, in spite of the small amounts of data available, a
compact transformer-based model can be trained to achieve very low error rates
by the use of backtranslation and appropriate tokenization.
Related papers
- Medical Concept Normalization in a Low-Resource Setting [0.0]
I investigate the challenges of medical concept normalization in a low-resource setting.
A dataset consisting of posts from a German medical online forum is annotated with concepts from the Unified Medical Language System.
Experiments demonstrate that multilingual Transformer-based models are able to outperform string similarity methods.
arXiv Detail & Related papers (2024-09-06T10:19:32Z) - Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z) - Mitigating Translationese in Low-resource Languages: The Storyboard Approach [9.676710061071809]
We propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences.
Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text.
We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency.
arXiv Detail & Related papers (2024-07-14T10:47:03Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - A Chat About Boring Problems: Studying GPT-based text normalization [22.64840464909988]
We show the capacity of Large-Language Models for text normalization in few-shot scenarios.
We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems.
We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
arXiv Detail & Related papers (2023-09-23T16:32:59Z) - The Best of Both Worlds: Combining Human and Machine Translations for
Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations.
An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z) - Multilingual Simplification of Medical Texts [49.469685530201716]
We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages.
We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses.
Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
arXiv Detail & Related papers (2023-05-21T18:25:07Z) - An End-to-end Chinese Text Normalization Model based on Rule-guided
Flat-Lattice Transformer [37.0774363352316]
We propose an end-to-end Chinese text normalization model, which accepts Chinese characters as direct input.
We also release a first publicly accessible largescale dataset for Chinese text normalization.
arXiv Detail & Related papers (2022-03-31T11:19:53Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.