ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction
- URL: http://arxiv.org/abs/2501.05222v1
- Date: Thu, 09 Jan 2025 13:19:55 GMT
- Title: ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction
- Authors: LĂ©ane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, Akiko Aizawa,
- Abstract summary: We explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision.
The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones.
Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches.
- Score: 26.64363135181992
- License:
- Abstract: Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.
Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.
Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z) - Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Revision [62.12545440385489]
We introduce Re3, a framework for joint analysis of collaborative document revision.
We present Re3-Sci, a large corpus of aligned scientific paper revisions manually labeled according to their action and intent.
We use the new data to provide first empirical insights into collaborative document revision in the academic domain.
arXiv Detail & Related papers (2024-05-31T21:19:09Z) - CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions [7.503795054002406]
We propose an original textual resource on the revision step of the writing process of scientific articles.
This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews.
arXiv Detail & Related papers (2024-03-01T03:07:32Z) - To Revise or Not to Revise: Learning to Detect Improvable Claims for
Argumentative Writing Support [20.905660642919052]
We explore the main challenges to identifying argumentative claims in need of specific revisions.
We propose a new sampling strategy based on revision distance.
We provide evidence that using contextual information and domain knowledge can further improve prediction results.
arXiv Detail & Related papers (2023-05-26T10:19:54Z) - Improving Iterative Text Revision by Learning Where to Edit from Other
Revision Tasks [11.495407637511878]
Iterative text revision improves text quality by fixing grammatical errors, rephrasing for better readability or contextual appropriateness, or reorganizing sentence structures throughout a document.
Most recent research has focused on understanding and classifying different types of edits in the iterative revision process from human-written text.
We aim to build an end-to-end text revision system that can iteratively generate helpful edits by explicitly detecting editable spans with their corresponding edit intents.
arXiv Detail & Related papers (2022-12-02T18:10:43Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z) - Towards Automated Document Revision: Grammatical Error Correction,
Fluency Edits, and Beyond [46.130399041820716]
We introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology.
We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.
arXiv Detail & Related papers (2022-05-23T17:37:20Z) - Understanding Iterative Revision from Human-Written Text [10.714872525208385]
IteraTeR is the first large-scale, multi-domain, edit-intention annotated corpus of iteratively revised text.
We better understand the text revision process, making vital connections between edit intentions and writing quality.
arXiv Detail & Related papers (2022-03-08T01:47:42Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.