ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
- URL: http://arxiv.org/abs/2502.02904v3
- Date: Mon, 17 Feb 2025 07:37:57 GMT
- Title: ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
- Authors: Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang,
- Abstract summary: ScholaWrite dataset is a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts.
Our dataset includes-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing.
- Score: 12.170448539143909
- License:
- Abstract: Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing), demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified data examples and code are available on our project page.
Related papers
- Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis [4.660229623034816]
The Nuremberg Letterbooks dataset comprises historical documents from the early 15th century.
The dataset includes 4 books containing 1711 labeled pages written by 10 scribes.
arXiv Detail & Related papers (2024-11-11T17:08:40Z) - BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation.
We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses.
Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z) - An end-to-end, interactive Deep Learning based Annotation system for
cursive and print English handwritten text [0.0]
We present an innovative, complete end-to-end pipeline, that annotates offline handwritten manuscripts written in both print and cursive English.
This novel method involves an architectural combination of a detection system built upon a state-of-the-art text detection model, and a custom made Deep Learning model for the recognition system.
arXiv Detail & Related papers (2023-04-18T00:24:07Z) - Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts [7.294418916091011]
We introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data.
Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow.
ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory.
arXiv Detail & Related papers (2023-03-31T20:33:03Z) - Exploitation and exploration in text evolution. Quantifying planning and
translation flows during writing [0.13108652488669734]
We introduce measures to quantify subcycles of planning (exploration) and translation (exploitation) during the writing process.
This dataset comes from a series of writing workshops in which, through innovative versioning software, we were able to record all the steps in the construction of a text.
arXiv Detail & Related papers (2023-02-07T17:52:33Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Effidit: Your AI Writing Assistant [60.588370965898534]
Effidit is a digital writing assistant that facilitates users to write higher-quality text more efficiently by using artificial intelligence (AI) technologies.
In Effidit, we significantly expand the capacities of a writing assistant by providing functions in five categories: text completion, error checking, text polishing, keywords to sentences (K2S), and cloud input methods (cloud IME)
arXiv Detail & Related papers (2022-08-03T02:24:45Z) - CoAuthor: Designing a Human-AI Collaborative Writing Dataset for
Exploring Language Model Capabilities [92.79451009324268]
We present CoAuthor, a dataset designed for revealing GPT-3's capabilities in assisting creative and argumentative writing.
We demonstrate that CoAuthor can address questions about GPT-3's language, ideation, and collaboration capabilities.
We discuss how this work may facilitate a more principled discussion around LMs' promises and pitfalls in relation to interaction design.
arXiv Detail & Related papers (2022-01-18T07:51:57Z) - Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues.
A main challenge is that a person often writes a letter in different styles from time to time.
We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z) - Characterizing Stage-Aware Writing Assistance in Collaborative Document
Authoring [14.512030721220437]
We present three studies that explore temporal stages of document authoring.
We conclude that writers do in fact conceptually progress through several distinct phases while authoring documents.
As a first step towards facilitating an intelligent digital writing assistant, we conduct a preliminary investigation into the utility of user interaction log data for predicting the temporal stage of a document.
arXiv Detail & Related papers (2020-08-18T21:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.