Related papers: Dialect Normalization using Large Language Models and Morphological Rules

Dialect Normalization using Large Language Models and Morphological Rules

URL: http://arxiv.org/abs/2506.08907v1
Date: Tue, 10 Jun 2025 15:34:34 GMT
Title: Dialect Normalization using Large Language Models and Morphological Rules
Authors: Antonios Dimakis, John Pavlopoulos, Antonios Anastasopoulos,
Abstract summary: We introduce a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting.<n>We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators.<n>We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information.
Score: 23.750564623399253
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.

Related papers

A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge'ez Script [3.5149312379702127]
Homophone normalization is a pre-processing step applied in Amharic Natural Language Processing literature.<n>We propose a post-inference intervention in which normalization is applied to model predictions instead of training data.<n>Our work contributes to the broader discussion on technology-facilitated language change and calls for more language-aware interventions.
arXiv Detail & Related papers (2025-07-20T22:35:08Z)
Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z)
Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z)
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities [36.578851892373365]
Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
arXiv Detail & Related papers (2023-05-25T18:18:42Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change [28.106524698188675]
Recent research has revealed that neural language models at scale suffer from poor temporal generalization capability. We propose a simple yet effective lexical-level masking strategy to post-train a converged language model.
arXiv Detail & Related papers (2022-10-31T08:12:41Z)
Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification [1.3999481573773074]
We propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
arXiv Detail & Related papers (2022-05-25T10:34:04Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology with Deep Learning [0.0]
We propose two approaches to dependency parsing especially for languages with restricted amount of training data. Our first approach combines a state-of-the-art deep learning-based with a rule-based approach and the second one incorporates morphological information into the network. The proposed methods are developed for Turkish, but can be adapted to other languages as well.
arXiv Detail & Related papers (2020-02-24T08:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.