ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
- URL: http://arxiv.org/abs/2506.09513v1
- Date: Wed, 11 Jun 2025 08:36:55 GMT
- Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
- Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu,
- Abstract summary: ReasonMed is the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths.<n>We train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
- Score: 44.96018028534255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
Related papers
- Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs [23.50838763761289]
We propose Mentor-Intern Collaborative Search (MICS) to generate rigorous and effective medical chain-of-thought data.<n>The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths.<n>Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy.
arXiv Detail & Related papers (2025-06-20T12:51:19Z) - Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards [21.831262938278915]
We introduce Med-PRM, a process reward modeling framework to verify each reasoning step against established medical knowledge bases.<n>Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50%.<n>We demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat.
arXiv Detail & Related papers (2025-06-13T05:36:30Z) - Disentangling Reasoning and Knowledge in Medical Large Language Models [23.401484250342158]
Medical reasoning in large language models aims to emulate clinicians' diagnostic thinking.<n>Current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall.<n>We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3)<n>We train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples.
arXiv Detail & Related papers (2025-05-16T17:16:27Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model [15.30318329533069]
Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning.<n>We propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework.<n>We demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b.
arXiv Detail & Related papers (2025-04-13T12:32:25Z) - MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs [39.65443626577068]
We introduce MedReason, a high-quality medical reasoning dataset.<n>Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets.<n>Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets.
arXiv Detail & Related papers (2025-04-01T17:31:44Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [19.448687758457318]
HuatuoGPT-o1, a medical LLM capable of complex reasoning, outperforms general and medical-specific baselines using only 40K verifiable problems.<n> Experiments show complex reasoning improves medical problem-solving and benefits more from reinforcement learning.
arXiv Detail & Related papers (2024-12-25T15:12:34Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.