Unsupervised extraction of local and global keywords from a single text
- URL: http://arxiv.org/abs/2307.14005v2
- Date: Fri, 14 Jun 2024 06:40:31 GMT
- Title: Unsupervised extraction of local and global keywords from a single text
- Authors: Lida Aleksanyan, Armen E. Allahverdyan,
- Abstract summary: We propose an unsupervised, corpus-independent method to extract keywords from a single text.
It is based on the spatial distribution of words and the response of this distribution to a random permutation of words.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.
Related papers
- Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Applying Transformer-based Text Summarization for Keyphrase Generation [2.28438857884398]
Keyphrases are crucial for searching and systematizing scholarly documents.
In this paper, we experiment with popular transformer-based models for abstractive text summarization.
We show that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERT.Score.
We also investigate several ordering strategies to target keyphrases.
arXiv Detail & Related papers (2022-09-08T13:01:52Z) - Divide and Conquer: Text Semantic Matching with Disentangled Keywords
and Intents [19.035917264711664]
We propose a training strategy for text semantic matching by disentangling keywords from intents.
Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency.
arXiv Detail & Related papers (2022-03-06T07:48:24Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Importance Estimation from Multiple Perspectives for Keyphrase
Extraction [34.51718374923614]
We propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as textitKIEMP)
textitKIEMP estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept consistency between phrase and the whole document.
Experimental results on six benchmark datasets show that textitKIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.
arXiv Detail & Related papers (2021-10-19T05:48:22Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
Keywords extraction is called identifying words or phrases that express the main concepts of texts in best.
We use a combined approach that consists of two models of graph centrality features and textural features.
arXiv Detail & Related papers (2021-04-10T18:30:17Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Keywords lie far from the mean of all words in local vector space [5.040463208115642]
In this work, we follow a different path to detect the keywords from a text document by modeling the main distribution of the document's words using local word vector representations.
We confirm the high performance of our approach compared to strong baselines and state-of-the-art unsupervised keyword extraction methods.
arXiv Detail & Related papers (2020-08-21T14:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.