Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search
- URL: http://arxiv.org/abs/2112.08918v1
- Date: Thu, 16 Dec 2021 14:37:41 GMT
- Title: Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search
- Authors: Rina Buoy and Nguonly Taing and Sovisal Chenda
- Abstract summary: Multiple orders of characters and different spelling realizations of words impose a constraint on Khmer word search functionality.
Spelling mistakes are common since robust spellcheckers are not commonly available across the input device platforms.
The proposed solutions include character order normalization, grapheme and phoneme-based spellcheckers, and Khmer word semantic model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Search is one of the key functionalities in digital platforms and
applications such as an electronic dictionary, a search engine, and an
e-commerce platform. While the search function in some languages is trivial,
Khmer word search is challenging given its complex writing system. Multiple
orders of characters and different spelling realizations of words impose a
constraint on Khmer word search functionality. Additionally, spelling mistakes
are common since robust spellcheckers are not commonly available across the
input device platforms. These challenges hinder the use of Khmer language in
search-embedded applications. Moreover, due to the absence of WordNet-like
lexical databases for Khmer language, it is impossible to establish semantic
relation between words, enabling semantic search. In this paper, we propose a
set of robust solutions to the above challenges associated with Khmer word
search. The proposed solutions include character order normalization, grapheme
and phoneme-based spellcheckers, and Khmer word semantic model. The semantic
model is based on the word embedding model that is trained on a 30-million-word
corpus and is used to capture the semantic similarities between words.
Related papers
- A Survey on Importance of Homophones Spelling Correction Model for Khmer Authors [0.0]
Homophones present a significant challenge to authors in any languages due to their similarities of pronunciations but different meanings and spellings.
This research aims to address the difficulties faced by Khmer authors when using homophones in their writing.
arXiv Detail & Related papers (2024-11-11T10:07:03Z) - Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval [0.0]
Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents.
Even Google does not deliver high accuracy for Khmer content.
This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods.
arXiv Detail & Related papers (2024-06-13T16:58:02Z) - KSW: Khmer Stop Word based Dictionary for Keyword Extraction [0.0]
This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary.
KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words.
Experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods.
arXiv Detail & Related papers (2024-05-27T17:42:54Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Keyword Embeddings for Query Suggestion [3.7900158137749322]
This paper proposes two novel models for the keyword suggestion task trained on scientific literature.
Our techniques adapt the architecture of Word2Vec and FastText to generate keyword embeddings by leveraging documents' keyword co-occurrence.
We evaluate our proposals against the state-of-the-art word and sentence embedding models showing considerable improvements over the baselines for the tasks.
arXiv Detail & Related papers (2023-01-19T11:13:04Z) - Semantic Search for Large Scale Clinical Ontologies [63.71950996116403]
We present a deep learning approach to build a search system for large clinical vocabularies.
We propose a Triplet-BERT model and a method that generates training data based on semantic training data.
The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to searching concept vocabularies.
arXiv Detail & Related papers (2022-01-01T05:15:42Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Quotient Space-Based Keyword Retrieval in Sponsored Search [7.639289301435027]
Synonymous keyword retrieval has become an important problem for sponsored search.
We propose a novel quotient space-based retrieval framework to address this problem.
This method has been successfully implemented in Baidu's online sponsored search system.
arXiv Detail & Related papers (2021-05-26T07:27:54Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.