Two Huge Title and Keyword Generation Corpora of Research Articles
- URL: http://arxiv.org/abs/2002.04689v1
- Date: Tue, 11 Feb 2020 21:17:29 GMT
- Title: Two Huge Title and Keyword Generation Corpora of Research Articles
- Authors: Erion \c{C}ano, Ond\v{r}ej Bojar
- Abstract summary: We introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research.
The data were retrieved from the Open Academic Graph which is a network of research profiles and publications.
We would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in sequence-to-sequence learning with neural networks
have considerably improved the quality of automatically generated text
summaries and document keywords, stipulating the need for even bigger training
corpora. Metadata of research articles are usually easy to find online and can
be used to perform research on various tasks. In this paper, we introduce two
huge datasets for text summarization (OAGSX) and keyword generation (OAGKX)
research, containing 34 million and 23 million records, respectively. The data
were retrieved from the Open Academic Graph which is a network of research
profiles and publications. We carefully processed each record and also tried
several extractive and abstractive methods of both tasks to create performance
baselines for other researchers. We further illustrate the performance of those
methods previewing their outputs. In the near future, we would like to apply
topic modeling on the two sets to derive subsets of research articles from more
specific disciplines.
Related papers
- Capturing research literature attitude towards Sustainable Development Goals: an LLM-based topic modeling approach [0.7806050661713976]
The Sustainable Development Goals were formulated by the United Nations in 2015 to address these global challenges by 2030.
Natural language processing techniques can help uncover discussions on SDGs within research literature.
We propose a completely automated pipeline to fetch content from the Scopus database and prepare datasets dedicated to five groups of SDGs.
arXiv Detail & Related papers (2024-11-05T09:37:23Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Synthesizing Scientific Summaries: An Extractive and Abstractive Approach [0.5904095466127044]
We propose a hybrid methodology for research paper summarisation.
We use two models based on unsupervised learning for the extraction stage and two transformer language models.
We find that using certain combinations of hyper parameters, it is possible for automated summarisation systems to exceed the abstractiveness of summaries written by humans.
arXiv Detail & Related papers (2024-07-29T08:21:42Z) - Named Entity Recognition Based Automatic Generation of Research
Highlights [3.9410617513331863]
We aim to automatically generate research highlights using different sections of a research paper as input.
We investigate whether the use of named entity recognition on the input improves the quality of the generated highlights.
arXiv Detail & Related papers (2023-02-25T16:33:03Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Sequential Sentence Classification in Research Papers using Cross-Domain
Multi-Task Learning [4.2443814047515716]
We propose a uniform deep learning architecture and multi-task learning to improve sequential sentence classification in scientific texts across domains.
Our approach outperforms the state of the art on three benchmark datasets.
arXiv Detail & Related papers (2021-02-11T13:54:10Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Topic-Centric Unsupervised Multi-Document Summarization of Scientific
and News Articles [3.0504782036247438]
We propose a topic-centric unsupervised multi-document summarization framework to generate abstractive summaries.
The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques.
Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics.
arXiv Detail & Related papers (2020-11-03T04:04:21Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.