GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction
- URL: http://arxiv.org/abs/2005.09740v1
- Date: Tue, 19 May 2020 20:24:02 GMT
- Title: GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction
- Authors: Javad Rafiei Asl, Juan M. Banda
- Abstract summary: We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
- Score: 1.0681288493631977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated methods for granular categorization of large corpora of text
documents have become increasingly more important with the rate scientific,
news, medical, and web documents are growing in the last few years. Automatic
keyphrase extraction (AKE) aims to automatically detect a small set of single
or multi-words from within a single textual document that captures the main
topics of the document. AKE plays an important role in various NLP and
information retrieval tasks such as document summarization and categorization,
full-text indexing, and article recommendation. Due to the lack of sufficient
human-labeled data in different textual contents, supervised learning
approaches are not ideal for automatic detection of keyphrases from the content
of textual bodies. With the state-of-the-art advances in text embedding
techniques, NLP researchers have focused on developing unsupervised methods to
obtain meaningful insights from raw datasets. In this work, we introduce Global
and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of AKE.
GLEAKE utilizes single and multi-word embedding techniques to explore the
syntactic and semantic aspects of the candidate phrases and then combines them
into a series of embedding-based graphs. Moreover, GLEAKE applies network
analysis techniques on each embedding-based graph to refine the most
significant phrases as a final set of keyphrases. We demonstrate the high
performance of GLEAKE by evaluating its results on five standard AKE datasets
from different domains and writing styles and by showing its superiority with
regards to other state-of-the-art methods.
Related papers
- Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Information Extraction in Domain and Generic Documents: Findings from
Heuristic-based and Data-driven Approaches [0.0]
Information extraction plays important role in natural language processing.
Document genre and length influence on IE tasks.
No single method demonstrated overwhelming performance in both tasks.
arXiv Detail & Related papers (2023-06-30T20:43:27Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Improving Keyphrase Extraction with Data Augmentation and Information
Filtering [67.43025048639333]
Keyphrase extraction is one of the essential tasks for document understanding in NLP.
We present a novel corpus and method for keyphrase extraction from the videos streamed on the Behance platform.
arXiv Detail & Related papers (2022-09-11T22:38:02Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - Unsupervised Keyphrase Extraction via Interpretable Neural Networks [27.774524511005172]
Keyphrases that are most useful for predicting the topic of a text are important keyphrases.
InSPECT is a self-explaining neural framework for identifying influential keyphrases.
We show that INSPECT achieves state-of-the-art results in unsupervised key extraction across four diverse datasets.
arXiv Detail & Related papers (2022-03-15T04:30:47Z) - Multi-Document Keyphrase Extraction: A Literature Review and the First
Dataset [24.91326715164367]
Multi-document keyphrase extraction has been infrequently studied, despite its utility for describing sets of documents.
We present here the first literature review and the first dataset for the task, MK-DUC-01, which can serve as a new benchmark.
arXiv Detail & Related papers (2021-10-03T19:10:28Z) - PerKey: A Persian News Corpus for Keyphrase Extraction and Generation [1.192436948211501]
PerKey is a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases.
The data was put into human assessment to ensure the quality of the keyphrases.
arXiv Detail & Related papers (2020-09-25T14:36:41Z) - BATS: A Spectral Biclustering Approach to Single Document Topic Modeling
and Segmentation [17.003488045214972]
Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available.
In developing a methodology to handle single documents, we face two major challenges.
First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms.
Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments.
arXiv Detail & Related papers (2020-08-05T16:34:33Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.