PerKey: A Persian News Corpus for Keyphrase Extraction and Generation
- URL: http://arxiv.org/abs/2009.12269v1
- Date: Fri, 25 Sep 2020 14:36:41 GMT
- Title: PerKey: A Persian News Corpus for Keyphrase Extraction and Generation
- Authors: Ehsan Doostmohammadi, Mohammad Hadi Bokaei, Hossein Sameti
- Abstract summary: PerKey is a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases.
The data was put into human assessment to ensure the quality of the keyphrases.
- Score: 1.192436948211501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Keyphrases provide an extremely dense summary of a text. Such information can
be used in many Natural Language Processing tasks, such as information
retrieval and text summarization. Since previous studies on Persian keyword or
keyphrase extraction have not published their data, the field suffers from the
lack of a human extracted keyphrase dataset. In this paper, we introduce
PerKey, a corpus of 553k news articles from six Persian news websites and
agencies with relatively high quality author extracted keyphrases, which is
then filtered and cleaned to achieve higher quality keyphrases. The resulted
data was put into human assessment to ensure the quality of the keyphrases. We
also measured the performance of different supervised and unsupervised
techniques, e.g. TFIDF, MultipartiteRank, KEA, etc. on the dataset using
precision, recall, and F1-score.
Related papers
- BibRank: Automatic Keyphrase Extraction Platform Using~Metadata [0.0]
This paper introduces a platform that integrates keyphrase datasets and facilitates the evaluation of keyphrase extraction algorithms.
The platform includes BibRank, an automatic keyphrase extraction algorithm that leverages a rich dataset obtained by parsing word in Bib format.
arXiv Detail & Related papers (2023-10-13T14:44:34Z) - Data Augmentation for Low-Resource Keyphrase Generation [46.52115499306222]
Keyphrase generation is the task of summarizing the contents of any given article into a few salient phrases (or keyphrases)
Existing works for the task mostly rely on large-scale annotated datasets, which are not easy to acquire.
We present data augmentation strategies specifically to address keyphrase generation in purely resource-constrained domains.
arXiv Detail & Related papers (2023-05-29T09:20:34Z) - Neural Keyphrase Generation: Analysis and Evaluation [47.004575377472285]
We study various tendencies exhibited by three strong models: T5 (based on a pre-trained transformer), CatSeq-Transformer (a non-pretrained Transformer), and ExHiRD (based on a recurrent neural network)
We propose a novel metric framework, SoftKeyScore, to evaluate the similarity between two sets of keyphrases.
arXiv Detail & Related papers (2023-04-27T00:10:21Z) - Improving Keyphrase Extraction with Data Augmentation and Information
Filtering [67.43025048639333]
Keyphrase extraction is one of the essential tasks for document understanding in NLP.
We present a novel corpus and method for keyphrase extraction from the videos streamed on the Behance platform.
arXiv Detail & Related papers (2022-09-11T22:38:02Z) - Applying Transformer-based Text Summarization for Keyphrase Generation [2.28438857884398]
Keyphrases are crucial for searching and systematizing scholarly documents.
In this paper, we experiment with popular transformer-based models for abstractive text summarization.
We show that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERT.Score.
We also investigate several ordering strategies to target keyphrases.
arXiv Detail & Related papers (2022-09-08T13:01:52Z) - Retrieval-Augmented Multilingual Keyphrase Generation with
Retriever-Generator Iterative Training [66.64843711515341]
Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text.
We call attention to a new setting named multilingual keyphrase generation.
We propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages.
arXiv Detail & Related papers (2022-05-21T00:45:21Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - Deep Keyphrase Completion [59.0413813332449]
Keyphrase provides accurate information of document content that is highly compact, concise, full of meanings, and widely used for discourse comprehension, organization, and text retrieval.
We propose textitkeyphrase completion (KPC) to generate more keyphrases for document (e.g. scientific publication) taking advantage of document content along with a very limited number of known keyphrases.
We name it textitdeep keyphrase completion (DKPC) since it attempts to capture the deep semantic meaning of the document content together with known keyphrases via a deep learning framework
arXiv Detail & Related papers (2021-10-29T07:15:35Z) - Persian Keyphrase Generation Using Sequence-to-Sequence Models [1.192436948211501]
Keyphrases are a summary of an input text and provide the main subjects discussed in the text.
In this paper, we try to tackle the problem of keyphrase generation and extraction from news articles using deep sequence-to-sequence models.
arXiv Detail & Related papers (2020-09-25T14:40:14Z) - GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.