WithdrarXiv: A Large-Scale Dataset for Retraction Study
- URL: http://arxiv.org/abs/2412.03775v1
- Date: Wed, 04 Dec 2024 23:36:23 GMT
- Title: WithdrarXiv: A Large-Scale Dataset for Retraction Study
- Authors: Delip Rao, Jonathan Young, Thomas Dietterich, Chris Callison-Burch,
- Abstract summary: We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv.
We develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations.
We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96.
- Score: 33.782357627001154
- License:
- Abstract: Retractions play a vital role in maintaining scientific integrity, yet systematic studies of retractions in computer science and other STEM fields remain scarce. We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv, containing over 14,000 papers and their associated retraction comments spanning the repository's entire history through September 2024. Through careful analysis of author comments, we develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations. We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96. Additionally, we release WithdrarXiv-SciFy, an enriched version including scripts for parsed full-text PDFs, specifically designed to enable research in scientific feasibility studies, claim verification, and automated theorem proving. These findings provide valuable insights for improving scientific quality control and automated verification systems. Finally, and most importantly, we discuss ethical issues and take a number of steps to implement responsible data release while fostering open science in this area.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Meta-survey on outlier and anomaly detection [0.0]
This paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection.
It collects nearly 500 papers using two specialized scientific search engines.
The paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods.
arXiv Detail & Related papers (2023-12-12T09:29:22Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - SciFact-Open: Towards open-domain scientific claim verification [61.288725621156864]
We present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems.
We collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models.
We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1.
arXiv Detail & Related papers (2022-10-25T05:45:00Z) - MS2: Multi-Document Summarization of Medical Studies [11.38740406132287]
We release MS2 (Multi-Document Summarization of Medical Studies), a dataset of over 470k documents and 20k summaries derived from the scientific literature.
This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies.
We experiment with a summarization system based on BART, with promising early results.
arXiv Detail & Related papers (2021-04-13T19:59:34Z) - Accelerating COVID-19 research with graph mining and transformer-based
learning [2.493740042317776]
We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research.
Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time.
We show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.
arXiv Detail & Related papers (2021-02-10T15:11:36Z) - Document Classification for COVID-19 Literature [15.458071120159307]
We provide an analysis of several multi-label document classification models on the LitCovid dataset.
We find that pre-trained language models fine-tuned on this dataset outperform all other baselines.
We also explore 50 errors made by the best performing models on LitCovid documents.
arXiv Detail & Related papers (2020-06-15T20:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.