Semantic Sensitive TF-IDF to Determine Word Relevance in Documents
- URL: http://arxiv.org/abs/2001.09896v2
- Date: Mon, 25 Jan 2021 23:52:07 GMT
- Title: Semantic Sensitive TF-IDF to Determine Word Relevance in Documents
- Authors: Amir Jalilifard, Vinicius F. Carid\'a, Alex F. Mansano, Rogers S.
Cristo, Felipe Penhorate C. da Fonseca
- Abstract summary: We propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus.
Our method managed to decrease the TF-IDF mean error rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to 27.2% of the original TF-IDF.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyword extraction has received an increasing attention as an important
research topic which can lead to have advancements in diverse applications such
as document context categorization, text indexing and document classification.
In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for
scoring word importance of informal documents in a corpus. A set of nearly four
million documents from health-care social media was collected and was trained
in order to draw semantic model and to find the word embeddings. Then, the
features of semantic space were utilized to rearrange the original TF-IDF
scores through an iterative solution so as to improve the moderate performance
of this algorithm on informal texts. After testing the proposed method with 200
randomly chosen documents, our method managed to decrease the TF-IDF mean error
rate by a factor of 50% and reaching the mean error of 13.7%, as opposed to
27.2% of the original TF-IDF.
Related papers
- Effects of term weighting approach with and without stop words removing
on Arabic text classification [0.9217021281095907]
This study compares the effects of Binary and Term frequency weighting feature methodologies on the text's classification method when stop words are eliminated.
For all metrics, the term frequency feature weighting approach with stop word removal outperforms the binary approach.
It is clear from the data that, using the same phrase weighting approach, stop word removing increases classification accuracy.
arXiv Detail & Related papers (2024-02-21T11:31:04Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - A Comparative Study on TF-IDF feature Weighting Method and its Analysis
using Unstructured Dataset [0.5156484100374058]
Term Frequency-Inverse Document Frequency (TF-IDF) and Natural Language Processing (NLP) are the most highly used information retrieval methods in text classification.
We have investigated and analyzed the feature weighting method for text classification on unstructured data.
The proposed model considered two features N-Grams and TF-IDF on IMDB movie reviews and Amazon Alexa reviews dataset for sentiment analysis.
arXiv Detail & Related papers (2023-08-08T04:27:34Z) - Method for Determining the Similarity of Text Documents for the Kazakh
language, Taking Into Account Synonyms: Extension to TF-IDF [0.0]
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval.
The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents.
The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.
arXiv Detail & Related papers (2022-11-22T15:54:41Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - Unsupervised Identification of Relevant Prior Cases [0.0]
We propose different unsupervised approaches to solve the task of identifying relevant precedents to a given query case.
Our proposed approaches are using word embeddings like word2vec, doc2vec, and sent2vec, finding cosine similarity using TF-IDF, retrieving relevant documents using BM25 scores, using the pre-trained model and SBERT to find the most similar document.
Based on the comparative analysis, we found that the TF-IDF score multiplied by the BM25 score gives the best result.
arXiv Detail & Related papers (2021-07-19T15:41:49Z) - Unsupervised Document Embedding via Contrastive Augmentation [48.71917352110245]
We present a contrasting learning approach with data augmentation techniques to learn document representations in unsupervised manner.
Inspired by recent contrastive self-supervised learning algorithms used for image and pretraining, we hypothesize that high-quality document embedding should be invariant to diverse paraphrases.
Our method can decrease the classification error rate by up to 6.4% over the SOTA approaches on the document classification task, matching or even surpassing fully-supervised methods.
arXiv Detail & Related papers (2021-03-26T15:48:52Z) - Extending Neural Keyword Extraction with TF-IDF tagset matching [4.014524824655106]
Keywords extraction is a task of identifying words that best describe a given document and serve in news portals to link articles of similar topics.
In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry.
arXiv Detail & Related papers (2021-01-31T15:39:17Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.