ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase
Generation
- URL: http://arxiv.org/abs/2101.08382v2
- Date: Fri, 5 Feb 2021 14:01:05 GMT
- Title: ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase
Generation
- Authors: Qingxiu Dong, Xiaojun Wan, Yue Cao
- Abstract summary: ParaSCI is the first large-scale paraphrase dataset in the scientific field.
This dataset includes 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and 316,063 pairs from arXiv (ParaSCI-arXiv)
- Score: 78.10924968931249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose ParaSCI, the first large-scale paraphrase dataset in the
scientific field, including 33,981 paraphrase pairs from ACL (ParaSCI-ACL) and
316,063 pairs from arXiv (ParaSCI-arXiv). Digging into characteristics and
common patterns of scientific papers, we construct this dataset though
intra-paper and inter-paper methods, such as collecting citations to the same
paper or aggregating definitions by scientific terms. To take advantage of
sentences paraphrased partially, we put up PDBERT as a general paraphrase
discovering method. The major advantages of paraphrases in ParaSCI lie in the
prominent length and textual diversity, which is complementary to existing
paraphrase datasets. ParaSCI obtains satisfactory results on human evaluation
and downstream tasks, especially long paraphrase generation.
Related papers
- MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR
Back-Translation [59.91139600152296]
ParaAMR is a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation.
We show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning.
arXiv Detail & Related papers (2023-05-26T02:27:33Z) - LDKP: A Dataset for Identifying Keyphrases from Long Scientific
Documents [48.84086818702328]
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval.
Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information.
This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract.
arXiv Detail & Related papers (2022-03-29T08:44:57Z) - SciNLI: A Corpus for Natural Language Inference on Scientific Text [47.293189105900524]
We introduce SciNLI, a large dataset for NLI that captures the formality in scientific text.
Our best performing model with XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23%.
arXiv Detail & Related papers (2022-03-13T18:23:37Z) - Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature [1.4190701053683017]
We propose a natural language processing paradigm to support the human task of identifying informal mentions made to research datasets.
The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research.
arXiv Detail & Related papers (2022-03-10T02:11:30Z) - Semantic Search as Extractive Paraphrase Span Detection [0.8137055256093007]
We frame the problem of semantic search by framing the search task as paraphrase span detection.
On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs, we find that our paraphrase span detection model outperforms two strong retrieval baselines.
We introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources are not available.
arXiv Detail & Related papers (2021-12-09T13:16:42Z) - Informational Space of Meaning for Scientific Texts [68.8204255655161]
We introduce the Meaning Space, in which the meaning of a word is represented by a vector of Relative Information Gain (RIG) about the subject categories that the text belongs to.
This new approach is applied to construct the Meaning Space based on Leicester Scientific Corpus (LSC) and Leicester Scientific Dictionary-Core (LScDC)
The most informative words are presented for 252 categories. The proposed model based on RIG is shown to have ability to stand out topic-specific words in categories.
arXiv Detail & Related papers (2020-04-28T14:26:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.