Related papers: Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

URL: http://arxiv.org/abs/2001.09896v2
Date: Mon, 25 Jan 2021 23:52:07 GMT
Title: Semantic Sensitive TF-IDF to Determine Word Relevance in Documents
Authors: Amir Jalilifard, Vinicius F. Carid\'a, Alex F. Mansano, Rogers S. Cristo, Felipe Penhorate C. da Fonseca
Abstract summary: We propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. Our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 200 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.

Related papers

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach [0.4499833362998489]
We propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content.<n>Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length.<n>We achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent.
arXiv Detail & Related papers (2025-06-08T18:09:43Z)
Effects of term weighting approach with and without stop words removing on Arabic text classification [0.9217021281095907]
This study compares the effects of Binary and Term frequency weighting feature methodologies on the text's classification method when stop words are eliminated. For all metrics, the term frequency feature weighting approach with stop word removal outperforms the binary approach. It is clear from the data that, using the same phrase weighting approach, stop word removing increases classification accuracy.
arXiv Detail & Related papers (2024-02-21T11:31:04Z)
Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z)
A Comparative Study on TF-IDF feature Weighting Method and its Analysis using Unstructured Dataset [0.5156484100374058]
Term Frequency-Inverse Document Frequency (TF-IDF) and Natural Language Processing (NLP) are the most highly used information retrieval methods in text classification. We have investigated and analyzed the feature weighting method for text classification on unstructured data. The proposed model considered two features N-Grams and TF-IDF on IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis.
arXiv Detail & Related papers (2023-08-08T04:27:34Z)
Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF [0.0]
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.
arXiv Detail & Related papers (2022-11-22T15:54:41Z)
Document-Level Relation Extraction with Sentences Importance Estimation and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences. We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss. Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
Out-of-Category Document Identification Using Target-Category Names as Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories. We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z)
Unsupervised Identification of Relevant Prior Cases [0.0]
We propose different unsupervised approaches to solve the task of identifying relevant precedents to a given query case. Our proposed approaches are using word embeddings like word2vec, doc2vec, and sent2vec, finding cosine similarity using TF-IDF, retrieving relevant documents using BM25 scores, using the pre-trained model and SBERT to find the most similar document. Based on the comparative analysis, we found that the TF-IDF score multiplied by the BM25 score gives the best result.
arXiv Detail & Related papers (2021-07-19T15:41:49Z)
Unsupervised Document Embedding via Contrastive Augmentation [48.71917352110245]
We present a contrasting learning approach with data augmentation techniques to learn document representations in unsupervised manner. Inspired by recent contrastive self-supervised learning algorithms used for image and pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases. Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.
arXiv Detail & Related papers (2021-03-26T15:48:52Z)
Extending Neural Keyword Extraction with TF-IDF tagset matching [4.014524824655106]
Keywords extraction is a task of identifying words that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry.
arXiv Detail & Related papers (2021-01-31T15:39:17Z)
Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem. Given a query (e.g., a question), return the set of relevant documents from a large document corpus. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.