SciDER: Scientific Data-centric End-to-end Researcher
- URL: http://arxiv.org/abs/2603.01421v1
- Date: Mon, 02 Mar 2026 03:53:20 GMT
- Title: SciDER: Scientific Data-centric End-to-end Researcher
- Authors: Ke Lin, Yilin Lu, Shreyas Bhat, Xuehang Guo, Junier Oliva, Qingyun Wang,
- Abstract summary: SciDER is a data-centric end-to-end system that automates the research lifecycle.<n>Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data.<n>We also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research.
- Score: 9.796056249900785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated scientific discovery with large language models is transforming the research lifecycle from ideation to experimentation, yet existing agents struggle to autonomously process raw data collected from scientific experiments. We introduce SciDER, a data-centric end-to-end system that automates the research lifecycle. Unlike traditional frameworks, our specialized agents collaboratively parse and analyze raw scientific data, generate hypotheses and experimental designs grounded in specific data characteristics, and write and execute corresponding code. Evaluation on three benchmarks shows SciDER excels in specialized data-driven scientific discovery and outperforms general-purpose agents and state-of-the-art models through its self-evolving memory and critic-led feedback loop. Distributed as a modular Python package, we also provide easy-to-use PyPI packages with a lightweight web interface to accelerate autonomous, data-driven research and aim to be accessible to all researchers and developers.
Related papers
- WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z) - ScienceDB AI: An LLM-Driven Agentic Recommender System for Large-Scale Scientific Data Sharing Services [36.35068691076956]
We present ScienceDB AI, a novel agentic recommender system developed on Science Data Bank (ScienceDB)<n>ScienceDB AI leverages natural language conversations and deep reasoning to accurately recommend datasets aligned with researchers' scientific intents.<n>The Trustworthy RAG provides citable references via Citable Task Record (CSTR) identifiers, enhancing recommendation and trustworthiness.
arXiv Detail & Related papers (2026-01-03T08:42:53Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [97.31347312130119]
SciRIFF (Scientific Resource for Instruction-Following and Finetuning) is a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks.<n>These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification.<n> SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Autonomous LLM-driven research from data to human-verifiable research papers [0.0]
We build an automation platform that guides interacting through complete stepwise process.
In mode provided annotated data alone, datapaper raised hypotheses, designed plans, wrote and interpreted analysis codes, generated and interpreted results.
We demonstrate potential for AI-driven acceleration of scientific discovery while enhancing traceability, transparency and verifiability.
arXiv Detail & Related papers (2024-04-24T23:15:49Z) - Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data [21.766339368749872]
We introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline.<n>TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM)<n>These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes.
arXiv Detail & Related papers (2024-02-15T06:30:12Z) - A user-centered approach to designing an experimental laboratory data
platform [0.0]
We take a user-centered approach to understand what essential elements of design and functionality researchers want in an experimental data platform.
We find that having the capability to contextualize rich, complex experimental datasets is the primary user requirement.
arXiv Detail & Related papers (2020-07-28T19:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.