Two Spelling Normalization Approaches Based on Large Language Models
- URL: http://arxiv.org/abs/2506.23288v1
- Date: Sun, 29 Jun 2025 15:25:09 GMT
- Title: Two Spelling Normalization Approaches Based on Large Language Models
- Authors: Miguel Domingo, Francisco Casacuberta,
- Abstract summary: spelling normalization endeavors to align a document's orthography with contemporary standards.<n>We propose two new approaches based on large language models, one of which has been trained without a supervised training, and a second one which has been trained for machine translation.<n>Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
- Score: 3.207455883863625
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
Related papers
- Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings [0.0]
This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation.<n>Our methodology's primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary.
arXiv Detail & Related papers (2025-07-09T03:09:14Z) - Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'<n>We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.<n>We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Semi-Supervised Learning for Bilingual Lexicon Induction [1.8130068086063336]
We consider the problem of aligning two sets of continuous word representations, corresponding to languages, to a common space in order to infer a bilingual lexicon.
Our experiments on standard benchmarks, inferring dictionary from English to more than 20 languages, show that our approach consistently outperforms existing state of the art benchmark.
arXiv Detail & Related papers (2024-02-10T19:27:22Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Grammatical Error Correction: A Survey of the State of the Art [15.174807142080187]
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text.
The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks.
arXiv Detail & Related papers (2022-11-09T19:34:38Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - A Call for More Rigor in Unsupervised Cross-lingual Learning [76.6545568416577]
An existing rationale for such research is based on the lack of parallel data for many of the world's languages.
We argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice.
arXiv Detail & Related papers (2020-04-30T17:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.