Hidden Division of Labor in Scientific Teams Revealed Through 1.6 Million LaTeX Files
- URL: http://arxiv.org/abs/2502.07263v1
- Date: Tue, 11 Feb 2025 05:07:36 GMT
- Title: Hidden Division of Labor in Scientific Teams Revealed Through 1.6 Million LaTeX Files
- Authors: Jiaxin Pei, Lulin Yang, Lingfei Wu,
- Abstract summary: We analyze author-specific macros in files from 1.6 million papers (1991-2023) by 2 million scientists.<n>Using explicit section information, we reveal a hidden division of labor within scientific teams.
- Score: 37.77089168249056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognition of individual contributions is fundamental to the scientific reward system, yet coauthored papers obscure who did what. Traditional proxies-author order and career stage-reinforce biases, while contribution statements remain self-reported and limited to select journals. We construct the first large-scale dataset on writing contributions by analyzing author-specific macros in LaTeX files from 1.6 million papers (1991-2023) by 2 million scientists. Validation against self-reported statements (precision = 0.87), author order patterns, field-specific norms, and Overleaf records (Spearman's rho = 0.6, p < 0.05) confirms the reliability of the created data. Using explicit section information, we reveal a hidden division of labor within scientific teams: some authors primarily contribute to conceptual sections (e.g., Introduction and Discussion), while others focus on technical sections (e.g., Methods and Experiments). These findings provide the first large-scale evidence of implicit labor division in scientific teams, challenging conventional authorship practices and informing institutional policies on credit allocation.
Related papers
- PreScience: A Benchmark for Forecasting Scientific Contributions [32.63164451901248]
PreScience is a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks.<n>We develop baselines and evaluations for each task, including LACERScore, a novel measure of contribution similarity.<n>The resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.
arXiv Detail & Related papers (2026-02-24T01:37:53Z) - Measuring the State of Open Science in Transportation Using Large Language Models [8.915048816245394]
Open science initiatives have strengthened scientific integrity and accelerated research progress across many fields.<n>Key features of open science, defined here as data and code availability, are difficult to extract due to the inherent complexity of the field.<n>This paper introduces an automatic and scalable feature-extraction pipeline to measure data and code availability in transportation research.
arXiv Detail & Related papers (2026-01-20T19:39:52Z) - When a Paper Has 1000 Authors: Rethinking Citation Metrics in the Era of LLMs [11.503915439591735]
Author-level citation metrics provide a practical, interpretable, and scalable signal of scholarly influence in a complex research ecosystem.<n>The past five years have seen the rapid emergence of large-scale publications in the field of large language models and foundation models.<n>We propose the SBCI index, analyze its theoretical properties, and evaluate its behavior on synthetic publication datasets.
arXiv Detail & Related papers (2025-08-08T04:18:26Z) - Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles [10.943765373420135]
We harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles.
We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata.
arXiv Detail & Related papers (2025-03-26T07:56:15Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - Cracking Double-Blind Review: Authorship Attribution with Deep Learning [43.483063713471935]
We propose a transformer-based, neural-network architecture to attribute an anonymous manuscript to an author.
We leverage all research papers publicly available on arXiv amounting to over 2 million manuscripts.
Our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly.
arXiv Detail & Related papers (2022-11-14T15:50:24Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics [32.4845534482475]
We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
arXiv Detail & Related papers (2021-01-25T17:54:06Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.