ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows
- URL: http://arxiv.org/abs/2510.20279v2
- Date: Fri, 24 Oct 2025 03:43:46 GMT
- Title: ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows
- Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Ziheng Qin, Bangyuan Zhu, Shengbin Huang, Xuanlei Zhao, Panpan Zhang, Xiaojiang Peng, Yuzhang Shang, Jianfei Yang, Zheng Zhu, Tianlong Chen, Zhangyang Wang, Kai Wang,
- Abstract summary: CS-54k is a high-quality corpus of scientific Q&A pairs in computer science.<n> CS-4k is a benchmark for evaluating AI's ability to assist scientific research.<n> CS-50k is a large-scale training dataset.
- Score: 109.34792911044394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for evaluating AI's ability to assist scientific research, and CS-50k, a large-scale training dataset. Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs into distinct capability tiers. Open models trained on CS-50k with supervised training and reinforcement learning demonstrate substantial improvements. Even 7B-scale models, when properly trained, outperform many larger proprietary systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that making AI models better research assistants relies more on domain-aligned training with high-quality data than on pretraining scale or general benchmark performance. We release CS-4k and CS-50k in the hope of fostering AI systems as reliable collaborators in CS research.
Related papers
- KARL: Knowledge Agents via Reinforcement Learning [63.627906947205624]
We present a system for training enterprise search agents via reinforcement learning.<n> KARLBench is a multi-capability evaluation suite spanning six distinct search regimes.<n>We show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark.
arXiv Detail & Related papers (2026-03-05T14:30:25Z) - InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery [138.0404718571971]
We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery.<n>The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution.<n>We evaluate InternAgent-1.5 on scientific reasoning benchmarks such as GAIA, HLE, GPQA, and FrontierScience.
arXiv Detail & Related papers (2026-02-09T18:36:06Z) - APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay [86.01901238059261]
APIGen-MT is a framework that generates verifiable and diverse multi-turn agent data.<n>We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters.<n>Our models outperform frontier models such as GPT-4o and Claude 3.5 on $tau$-bench and BFCL benchmarks.
arXiv Detail & Related papers (2025-04-04T17:13:57Z) - How Well Can AI Build SD Models? [0.0]
We introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance)<n>We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction.
arXiv Detail & Related papers (2025-03-19T14:48:47Z) - MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks.<n>This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.<n>We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z) - CycleResearcher: Improving Automated Research via Automated Review [37.03497673861402]
This paper explores the possibility of using open-source post-trained large language models (LLMs) as autonomous agents capable of performing the full cycle of automated research and review.<n>To train these models, we develop two new datasets, reflecting real-world machine learning research and peer review dynamics.<n>Our results demonstrate that CycleReviewer achieves promising performance with a 26.89% reduction in mean absolute error (MAE) compared to individual human reviewers in predicting paper scores.
arXiv Detail & Related papers (2024-10-28T08:10:21Z) - Evaluating Large Language Models on the GMAT: Implications for the
Future of Business Education [0.13654846342364302]
This study introduces the first benchmark to assess the performance of seven major Large Language Models (LLMs)
Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools.
While AI's promise in education, assessment, and tutoring is clear, challenges remain.
arXiv Detail & Related papers (2024-01-02T03:54:50Z) - DataComp: In search of the next generation of multimodal datasets [179.79323076587255]
DataComp is a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Our benchmark consists of multiple compute scales spanning four orders of magnitude.
In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2023-04-27T11:37:18Z) - Large-scale learning of generalised representations for speaker
recognition [52.978310296712834]
We develop a speaker recognition model to be used in diverse scenarios.
We investigate several new training data configurations combining a few existing datasets.
We find that MFA-Conformer with the least inductive bias generalises the best.
arXiv Detail & Related papers (2022-10-20T03:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.