arXivEdits: Understanding the Human Revision Process in Scientific
Writing
- URL: http://arxiv.org/abs/2210.15067v1
- Date: Wed, 26 Oct 2022 22:50:24 GMT
- Title: arXivEdits: Understanding the Human Revision Process in Scientific
Writing
- Authors: Chao Jiang and Wei Xu and Samuel Stevens
- Abstract summary: We provide a complete computational framework for studying text revision in scientific writing.
We first introduce arXivEdits, a new annotated corpus of 751 full papers from arXiv with gold sentence alignment across their multiple versions of revision.
It supports our data-driven analysis to unveil the common strategies practiced by researchers for revising their papers.
- Score: 17.63505461444103
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scientific publications are the primary means to communicate research
discoveries, where the writing quality is of crucial importance. However, prior
work studying the human editing process in this domain mainly focused on the
abstract or introduction sections, resulting in an incomplete picture. In this
work, we provide a complete computational framework for studying text revision
in scientific writing. We first introduce arXivEdits, a new annotated corpus of
751 full papers from arXiv with gold sentence alignment across their multiple
versions of revision, as well as fine-grained span-level edits and their
underlying intentions for 1,000 sentence pairs. It supports our data-driven
analysis to unveil the common strategies practiced by researchers for revising
their papers. To scale up the analysis, we also develop automatic methods to
extract revision at document-, sentence-, and word-levels. A neural CRF
sentence alignment model trained on our corpus achieves 93.8 F1, enabling the
reliable matching of sentences between different versions. We formulate the
edit extraction task as a span alignment problem, and our proposed method
extracts more fine-grained and explainable edits, compared to the commonly used
diff algorithm. An intention classifier trained on our dataset achieves 78.9 F1
on the fine-grained intent classification task. Our data and system are
released at tiny.one/arxivedits.
Related papers
- CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions [7.503795054002406]
We propose an original textual resource on the revision step of the writing process of scientific articles.
This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews.
arXiv Detail & Related papers (2024-03-01T03:07:32Z) - A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - Cracking Double-Blind Review: Authorship Attribution with Deep Learning [43.483063713471935]
We propose a transformer-based, neural-network architecture to attribute an anonymous manuscript to an author.
We leverage all research papers publicly available on arXiv amounting to over 2 million manuscripts.
Our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly.
arXiv Detail & Related papers (2022-11-14T15:50:24Z) - Text Revision by On-the-Fly Representation Optimization [76.11035270753757]
Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems.
We present an iterative in-place editing approach for text revision, which requires no parallel data.
It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification.
arXiv Detail & Related papers (2022-04-15T07:38:08Z) - Paperswithtopic: Topic Identification from Paper Title Only [5.025654873456756]
We present a dataset of papers paired by title and sub-field from the field of artificial intelligence (AI)
We also present results on how to predict a paper's AI sub-field from a given paper title only.
For the transformer models, we also present gradient-based, attention visualizations to further explain the model's classification process.
arXiv Detail & Related papers (2021-10-09T06:32:09Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Heterogeneous Graph Neural Networks for Extractive Document
Summarization [101.17980994606836]
Cross-sentence relations are a crucial step in extractive document summarization.
We present a graph-based neural network for extractive summarization (HeterSumGraph)
We introduce different types of nodes into graph-based neural networks for extractive document summarization.
arXiv Detail & Related papers (2020-04-26T14:38:11Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.