Related papers: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

URL: http://arxiv.org/abs/2601.12910v1
Date: Mon, 19 Jan 2026 10:04:33 GMT
Title: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
Authors: Tim Baumgärtner, Iryna Gurevych,
Abstract summary: We present SciCoQA, a dataset for detecting discrepancies between scientific publications and theirs.<n>Our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines.<n>The best performing model in our evaluation, GPT-5, can only detect 45.7% of real-world paper-code discrepancies.
Score: 53.70401063640645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7\% of real-world paper-code discrepancies.

Related papers

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z)
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers [16.80818230868491]
This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers.<n>To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024.<n>Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions.
arXiv Detail & Related papers (2025-03-31T22:02:24Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [97.31347312130119]
SciRIFF (Scientific Resource for Instruction-Following and Finetuning) is a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks.<n>These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification.<n> SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z)
CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis [33.190021245507445]
Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplieds and outperforms a suite of baselines.
arXiv Detail & Related papers (2020-08-28T19:57:49Z)
Cascade Neural Ensemble for Identifying Scientifically Sound Articles [0.0]
A barrier to conducting systematic reviews and meta-analysis is efficiently finding scientifically sound relevant articles. We trained and tested several ensemble architectures of SciBERT on a dataset of about 50K articles from MEDLINE. The cascade ensemble architecture achieved 0.7505 F measure, an impressive 49.1% error rate reduction.
arXiv Detail & Related papers (2020-04-13T22:23:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.