KSW: Khmer Stop Word based Dictionary for Keyword Extraction
- URL: http://arxiv.org/abs/2405.17390v1
- Date: Mon, 27 May 2024 17:42:54 GMT
- Title: KSW: Khmer Stop Word based Dictionary for Keyword Extraction
- Authors: Nimol Thuon, Wangrui Zhang, Sada Thuon,
- Abstract summary: This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary.
KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words.
Experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary. Due to the limited availability of natural language processing resources for the Khmer language, effective keyword extraction has been a significant challenge. KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words, thereby enhancing the extraction of meaningful keywords. Our experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods, highlighting its potential to advance Khmer text processing and information retrieval. The KSW resources, including the stop word dictionary, are available at the following GitHub repository: (https://github.com/back-kh/KSWv2-Khmer-Stop-Word-based-Dictionary-for-Keyword-Extraction.git).
Related papers
- SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting [5.697227044927832]
Keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works.
We introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity.
The proposed method improves the baseline results on hard dataset, increasing AUC from $88.52$ to $94.9$ and reducing EER from $18.82 to $11.1$.
arXiv Detail & Related papers (2024-09-06T01:08:29Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval [0.4499833362998489]
Stopwords are commonly used words in a language that are considered to be of little value in determining the meaning or significance of a document.
Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences.
arXiv Detail & Related papers (2024-06-16T17:59:05Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT [59.245414547751636]
We propose a circuit discovery framework alternative to activation patching.
Our framework suffers less from out-of-distribution and proves to be more efficient in terms of complexity.
We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
arXiv Detail & Related papers (2024-02-19T15:04:53Z) - Open-vocabulary Keyword-spotting with Adaptive Instance Normalization [18.250276540068047]
We propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters.
We show significant improvements over recent keyword spotting and ASR baselines.
arXiv Detail & Related papers (2023-09-13T13:49:42Z) - Retrieval-Augmented Multilingual Keyphrase Generation with
Retriever-Generator Iterative Training [66.64843711515341]
Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text.
We call attention to a new setting named multilingual keyphrase generation.
We propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages.
arXiv Detail & Related papers (2022-05-21T00:45:21Z) - Deep Keyphrase Completion [59.0413813332449]
Keyphrase provides accurate information of document content that is highly compact, concise, full of meanings, and widely used for discourse comprehension, organization, and text retrieval.
We propose textitkeyphrase completion (KPC) to generate more keyphrases for document (e.g. scientific publication) taking advantage of document content along with a very limited number of known keyphrases.
We name it textitdeep keyphrase completion (DKPC) since it attempts to capture the deep semantic meaning of the document content together with known keyphrases via a deep learning framework
arXiv Detail & Related papers (2021-10-29T07:15:35Z) - FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
Keywords extraction is called identifying words or phrases that express the main concepts of texts in best.
We use a combined approach that consists of two models of graph centrality features and textural features.
arXiv Detail & Related papers (2021-04-10T18:30:17Z) - BERT for Monolingual and Cross-Lingual Reverse Dictionary [56.8627517256663]
We propose a simple but effective method to make BERT generate the target word for this specific task.
By using the BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding.
arXiv Detail & Related papers (2020-09-30T17:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.