Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval
- URL: http://arxiv.org/abs/2406.09320v2
- Date: Sun, 16 Jun 2024 19:08:34 GMT
- Title: Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval
- Authors: Nimol Thuon,
- Abstract summary: Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents.
Even Google does not deliver high accuracy for Khmer content.
This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The search engine process is crucial for document content retrieval. For Khmer documents, an effective tool is needed to extract essential keywords and facilitate accurate searches. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries, performs precise matching, and provides the best matching offline documents and online URLs. We propose three semantic search frameworks: semantic search based on a keyword dictionary, semantic search based on ontology, and semantic search based on ranking. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and addressed issues related to searching and semantic search. Our findings demonstrate that understanding search term semantics can lead to significantly more accurate results.
Related papers
- PseudoSeer: a Search Engine for Pseudocode [18.726136894285403]
A novel pseudocode search engine is designed to facilitate efficient retrieval and search of academic papers containing pseudocode.
By leveraging snippets, the system enables users to search across various facets of a paper, such as the title, abstract, author information, and code snippets.
A weighted BM25-based ranking algorithm is used by the search engine, and factors considered when prioritizing search results are described.
arXiv Detail & Related papers (2024-11-19T16:58:03Z) - Taxonomy-guided Semantic Indexing for Academic Paper Search [51.07749719327668]
TaxoIndex is a semantic index framework for academic paper search.
It organizes key concepts from papers as a semantic index guided by an academic taxonomy.
It can be flexibly employed to enhance existing dense retrievers.
arXiv Detail & Related papers (2024-10-25T00:00:17Z) - VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and
Optimized Search [1.0411820336052784]
We propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval.
By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy.
Experiments on real-world datasets show that VectorSearch outperforms baseline metrics.
arXiv Detail & Related papers (2024-09-25T21:58:08Z) - Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language [0.0]
This paper endeavors to establish a straightforward yet potent benchmark for semantic search in Arabic.
To precisely evaluate the effectiveness of these metrics and the dataset, we conduct our assessment of semantic search within the framework of retrieval augmented generation (RAG)
arXiv Detail & Related papers (2024-03-27T08:42:31Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Khmer Word Search: Challenges, Solutions, and Semantic-Aware Search [0.0]
Multiple orders of characters and different spelling realizations of words impose a constraint on Khmer word search functionality.
Spelling mistakes are common since robust spellcheckers are not commonly available across the input device platforms.
The proposed solutions include character order normalization, grapheme and phoneme-based spellcheckers, and Khmer word semantic model.
arXiv Detail & Related papers (2021-12-16T14:37:41Z) - Deep Keyphrase Completion [59.0413813332449]
Keyphrase provides accurate information of document content that is highly compact, concise, full of meanings, and widely used for discourse comprehension, organization, and text retrieval.
We propose textitkeyphrase completion (KPC) to generate more keyphrases for document (e.g. scientific publication) taking advantage of document content along with a very limited number of known keyphrases.
We name it textitdeep keyphrase completion (DKPC) since it attempts to capture the deep semantic meaning of the document content together with known keyphrases via a deep learning framework
arXiv Detail & Related papers (2021-10-29T07:15:35Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Neural Extractive Search [53.15076679818303]
Domain experts often need to extract structured information from large corpora.
We advocate for a search paradigm called extractive search'', in which a search query is enriched with capture-slots.
We show how the recall can be improved using neural retrieval and alignment.
arXiv Detail & Related papers (2021-06-08T18:03:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.