SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation
- URL: http://arxiv.org/abs/2110.10774v1
- Date: Wed, 20 Oct 2021 20:37:11 GMT
- Title: SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation
- Authors: Hong Chen, Hiroya Takamura, Hideki Nakayama
- Abstract summary: We propose a new task, namely textbfcontext-aware text generation in the scientific domain.
We present a novel large-scale textbfScientific Paper dataset for ContetextbfXt-Aware Text textbfGeneration (SciXGen)
We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph.
- Score: 27.064042116555925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating texts in scientific papers requires not only capturing the content
contained within the given input but also frequently acquiring the external
information called \textit{context}. We push forward the scientific text
generation by proposing a new task, namely \textbf{context-aware text
generation} in the scientific domain, aiming at exploiting the contributions of
context in generated texts. To this end, we present a novel challenging
large-scale \textbf{Sci}entific Paper Dataset for Conte\textbf{X}t-Aware Text
\textbf{Gen}eration (SciXGen), consisting of well-annotated 205,304 papers with
full references to widely-used objects (e.g., tables, figures, algorithms) in a
paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of
our newly constructed SciXGen dataset in generating description and paragraph.
Our dataset and benchmarks will be made publicly available to hopefully
facilitate the scientific text generation research.
Related papers
- Augmenting Textual Generation via Topology Aware Retrieval [30.933176170660683]
We develop a Topology-aware Retrieval-augmented Generation framework.
This framework includes a retrieval module that selects texts based on their topological relationships.
We have curated established text-attributed networks and conducted comprehensive experiments to validate the effectiveness of this framework.
arXiv Detail & Related papers (2024-05-27T19:02:18Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Event Transition Planning for Open-ended Text Generation [55.729259805477376]
Open-ended text generation tasks require models to generate a coherent continuation given limited preceding context.
We propose a novel two-stage method which explicitly arranges the ensuing events in open-ended text generation.
Our approach can be understood as a specially-trained coarse-to-fine algorithm.
arXiv Detail & Related papers (2022-04-20T13:37:51Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Topic Modeling Based Extractive Text Summarization [0.0]
We propose a novel method to summarize a text document by clustering its contents based on latent topics.
We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization.
arXiv Detail & Related papers (2021-06-29T12:28:19Z) - A Survey of Knowledge-Enhanced Text Generation [81.24633231919137]
The goal of text generation is to make machines express in human language.
Various neural encoder-decoder models have been proposed to achieve the goal by learning to map input text to output text.
To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models.
arXiv Detail & Related papers (2020-10-09T06:46:46Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.