LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents
- URL: http://arxiv.org/abs/2203.15349v1
- Date: Tue, 29 Mar 2022 08:44:57 GMT
- Title: LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents
- Authors: Debanjan Mahata, Naveen Agarwal, Dibya Gautam, Amardeep Kumar, Swapnil
Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah
- Abstract summary: Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
- Score: 48.84086818702328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying keyphrases (KPs) from text documents is a fundamental task in
natural language processing and information retrieval. Vast majority of the
benchmark datasets for this task are from the scientific domain containing only
the document title and abstract information. This limits keyphrase extraction
(KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from
human-written summaries that are often very short (approx 8 sentences). This
presents three challenges for real-world applications: human-written summaries
are unavailable for most documents, the documents are almost always long, and a
high percentage of KPs are directly found beyond the limited context of title
and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M
and ~100K scientific articles with their fully extracted text and additional
metadata including publication venue, year, author, field of study, and
citations for facilitating research on this real-world problem.
Related papers
- Improving Keyphrase Extraction with Data Augmentation and Information
Filtering [67.43025048639333]
Keyphrase extraction is one of the essential tasks for document understanding in NLP.
We present a novel corpus and method for keyphrase extraction from the videos streamed on the Behance platform.
arXiv Detail & Related papers (2022-09-11T22:38:02Z) - Keyphrase Generation Beyond the Boundaries of Title and Abstract [28.56508031460787]
Keyphrase generation aims at generating phrases (keyphrases) that best describe a given document.
In this work, we explore whether the integration of additional data from semantically similar articles or from the full text of the given article can be helpful for a neural keyphrase generation model.
We discover that adding sentences from the full text particularly in the form of summary of the article can significantly improve the generation of both types of keyphrases.
arXiv Detail & Related papers (2021-12-13T16:33:01Z) - Deep Keyphrase Completion [59.0413813332449]
Keyphrase provides accurate information of document content that is highly compact, concise, full of meanings, and widely used for discourse comprehension, organization, and text retrieval.
We propose textitkeyphrase completion (KPC) to generate more keyphrases for document (e.g. scientific publication) taking advantage of document content along with a very limited number of known keyphrases.
We name it textitdeep keyphrase completion (DKPC) since it attempts to capture the deep semantic meaning of the document content together with known keyphrases via a deep learning framework
arXiv Detail & Related papers (2021-10-29T07:15:35Z) - Multi-Document Keyphrase Extraction: A Literature Review and the First
Dataset [24.91326715164367]
Multi-document keyphrase extraction has been infrequently studied, despite its utility for describing sets of documents.
We present here the first literature review and the first dataset for the task, MK-DUC-01, which can serve as a new benchmark.
arXiv Detail & Related papers (2021-10-03T19:10:28Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase
Generation [78.10924968931249]
ParaSCI is the first large-scale paraphrase dataset in the scientific field.
This dataset includes 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and 316,063 pairs from arXiv (ParaSCI-arXiv)
arXiv Detail & Related papers (2021-01-21T01:10:06Z) - PerKey: A Persian News Corpus for Keyphrase Extraction and Generation [1.192436948211501]
PerKey is a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases.
The data was put into human assessment to ensure the quality of the keyphrases.
arXiv Detail & Related papers (2020-09-25T14:36:41Z) - GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.