Related papers: Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning

Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning

URL: http://arxiv.org/abs/2505.19501v2
Date: Mon, 02 Jun 2025 21:31:08 GMT
Title: Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning
Authors: Ming Yin, Yuanhao Qu, Ling Yang, Le Cong, Mengdi Wang,
Abstract summary: We introduce Genome-Bench, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering.<n>Our pipeline transforms raw interactions into a reinforcement learning-friendly multiple-choice questions format, supported by 3000+ high-quality question-answer pairs.<n>Our results show that reinforcement learning from scientific discussions improves model performance by over 15% compared to the base model on Genome-Bench.
Score: 45.551731507535735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate how to teach large language models (LLMs) to perform scientific reasoning by leveraging expert discussions as a learning signal. Focusing on the genomics domain, we develop an automated pipeline to extract trainable data and introduce Genome-Bench, a new benchmark constructed from over a decade of scientific forum discussions on genome engineering. Our pipeline transforms raw interactions into a reinforcement learning-friendly multiple-choice questions format, supported by 3000+ high-quality question-answer pairs spanning foundational biology, experimental troubleshooting, tool usage, and beyond. We fine-tune an LLM using RL with a rule-based reward signal derived from the synthetic MCQ dataset to enhance domain-specific reasoning. Our results show that reinforcement learning from scientific discussions improves model performance by over 15% compared to the base model on Genome-Bench, narrowing the gap between open-source LLMs and expert-level reasoning. To our knowledge, this is the first end-to-end pipeline for teaching LLMs to reason from scientific discussions, with promising potential for generalization across scientific domains beyond biology.

Related papers

Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO [17.61466802557524]
Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks.<n>We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE.
arXiv Detail & Related papers (2025-05-28T07:47:46Z)
General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z)
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning [64.7438151207189]
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research.<n>We present textbfARtificial textbfIntelligence research assistant for textbfExpert-involved textbfLearning (ARIEL)
arXiv Detail & Related papers (2025-05-03T14:21:48Z)
Large Language Models for Zero-shot Inference of Causal Structures in Biology [4.650342334505084]
We present a framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology.<n>We systematically evaluate causal claims obtained from an LLM using real-world interventional data.<n>Our results show that even relatively small LLMs can capture meaningful aspects of causal structure in biological systems.
arXiv Detail & Related papers (2025-03-06T11:43:30Z)
Biological Sequence with Language Model Prompting: A Survey [14.270959261105968]
Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains.<n>This paper systematically investigates the application of prompt-based methods with LLMs to biological sequences.
arXiv Detail & Related papers (2025-03-06T06:28:36Z)
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning [49.487327661584686]
We introduce BioMaze, a dataset with 5.1K complex pathway problems from real research.<n>Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning.<n>To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.
arXiv Detail & Related papers (2025-02-23T17:38:10Z)
Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs [23.608962459019278]
We introduce a novel benchmark to evaluate Large Language Models (LLMs) for scientific discovery in both natural and social sciences.<n>Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications.<n>We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases.
arXiv Detail & Related papers (2025-02-21T05:35:20Z)
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z)
A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis [0.8049701904919515]
This paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences.<n>We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences.
arXiv Detail & Related papers (2024-12-10T05:33:09Z)
Multimodal large language model for wheat breeding: a new exploration of smart breeding [13.849056190321189]
Multidisciplinary nature of breeding has brought technical barriers and efficiency challenges to knowledge mining. This study used supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and reinforcement learning from human feedback (RLHF) technologies to inject cross-domain knowledge into MLLMs. The WBLM can generate professional decision support answers for phenotyping estimation, environmental stress assessment, target germplasm screening, cultivation technique recommendation, and seed price query tasks.
arXiv Detail & Related papers (2024-11-20T04:47:42Z)
Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge. LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect. We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
To Transformers and Beyond: Large Language Models for the Genome [2.799755865110429]
This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers. We contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research.
arXiv Detail & Related papers (2023-11-13T02:13:58Z)
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
ScienceWorld: Is your Agent Smarter than a 5th Grader? [12.066880938687154]
This paper presents a new benchmark, ScienceWorld, to test agents' scientific reasoning abilities. Current state-of-the-art models are unable to reason about or explain learned science concepts in novel contexts.
arXiv Detail & Related papers (2022-03-14T22:52:34Z)
SciFive: a text-to-text transformer model for biomedical literature [0.9482369543628087]
We introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area.
arXiv Detail & Related papers (2021-05-28T06:09:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.