Khmer Spellchecking: A Holistic Approach
- URL: http://arxiv.org/abs/2511.09812v1
- Date: Fri, 14 Nov 2025 01:10:52 GMT
- Title: Khmer Spellchecking: A Holistic Approach
- Authors: Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing,
- Abstract summary: This paper proposes a holistic approach to the Khmer spellchecking problem.<n>It integrates Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges.<n> Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.
Related papers
- Towards Explainable Khmer Polarity Classification [0.0]
This paper proposes an explainable Khmer polarity by fine-tuning an instruction-based reasoning Qwen-3 model.<n> Experimental results show that the fine-tuned model not only predicts labels accurately but also provides reasoning by identifying polarity-related keywords.
arXiv Detail & Related papers (2025-11-12T13:23:47Z) - Evaluating the Impact of Khmer Font Types on Text Recognition [0.7743559889795233]
Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly.<n>This study underscores the critical importance of font selection in optimizing Khmer text recognition.
arXiv Detail & Related papers (2025-06-30T15:35:51Z) - Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [61.601626186678146]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 8%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - A Survey on Importance of Homophones Spelling Correction Model for Khmer Authors [0.0]
Homophones present a significant challenge to authors in any languages due to their similarities of pronunciations but different meanings and spellings.
This research aims to address the difficulties faced by Khmer authors when using homophones in their writing.
arXiv Detail & Related papers (2024-11-11T10:07:03Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - Correcting Real-Word Spelling Errors: A New Hybrid Approach [1.5469452301122175]
A new hybrid approach is proposed which relies on statistical and syntactic knowledge to detect and correct real-word errors.
The model can prove to be more practical than some other models, such as WordNet-based method of Hirst and Budanitsky and fixed windows size method of Wilcox-O'Hearn and Hirst.
arXiv Detail & Related papers (2023-02-09T06:03:11Z) - Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search [0.0]
Multiple orders of characters and different spelling realizations of words impose a constraint on Khmer word search functionality.
Spelling mistakes are common since robust spellcheckers are not commonly available across the input device platforms.
The proposed solutions include character order normalization, grapheme and phoneme-based spellcheckers, and Khmer word semantic model.
arXiv Detail & Related papers (2021-12-16T14:37:41Z) - On Sampling-Based Training Criteria for Neural Language Modeling [97.35284042981675]
We consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation.
We show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities.
Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim.
arXiv Detail & Related papers (2021-04-21T12:55:52Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.