Method for Determining the Similarity of Text Documents for the Kazakh
language, Taking Into Account Synonyms: Extension to TF-IDF
- URL: http://arxiv.org/abs/2211.12364v1
- Date: Tue, 22 Nov 2022 15:54:41 GMT
- Title: Method for Determining the Similarity of Text Documents for the Kazakh
language, Taking Into Account Synonyms: Extension to TF-IDF
- Authors: Bakhyt Bakiyev
- Abstract summary: The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval.
The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents.
The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of determining the similarity of text documents has received
considerable attention in many areas such as Information Retrieval, Text
Mining, Natural Language Processing (NLP) and Computational Linguistics.
Transferring data to numeric vectors is a complex task where algorithms such as
tokenization, stopword filtering, stemming, and weighting of terms are used.
The term frequency - inverse document frequency (TF-IDF) is the most widely
used term weighting method to facilitate the search for relevant documents. To
improve the weighting of terms, a large number of TF-IDF extensions are made.
In this paper, another extension of the TF-IDF method is proposed where
synonyms are taken into account. The effectiveness of the method is confirmed
by experiments on functions such as Cosine, Dice and Jaccard to measure the
similarity of text documents for the Kazakh language.
Related papers
- Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Summarization-Based Document IDs for Generative Retrieval with Language Models [65.11811787587403]
We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases.
We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively.
We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - A Comparative Study on TF-IDF feature Weighting Method and its Analysis
using Unstructured Dataset [0.5156484100374058]
Term Frequency-Inverse Document Frequency (TF-IDF) and Natural Language Processing (NLP) are the most highly used information retrieval methods in text classification.
We have investigated and analyzed the feature weighting method for text classification on unstructured data.
The proposed model considered two features N-Grams and TF-IDF on IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis.
arXiv Detail & Related papers (2023-08-08T04:27:34Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Multilingual Search with Subword TF-IDF [0.0]
subword TF-IDF (STF-IDF) can offer higher accuracy without suchs.
XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English.
arXiv Detail & Related papers (2022-09-28T17:49:37Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - TFW2V: An Enhanced Document Similarity Method for the Morphologically
Rich Finnish Language [0.5801044612920816]
This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language.
We propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data.
arXiv Detail & Related papers (2021-12-23T12:27:45Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Semantic Sensitive TF-IDF to Determine Word Relevance in Documents [0.0]
We propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus.
Our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
arXiv Detail & Related papers (2020-01-06T00:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.