HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
- URL: http://arxiv.org/abs/2412.18925v1
- Date: Wed, 25 Dec 2024 15:12:34 GMT
- Title: HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
- Authors: Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang,
- Abstract summary: HuatuoGPT-o1, a medical LLM capable of complex reasoning, outperforms general and medical-specific baselines using only 40K verifiable problems.<n> Experiments show complex reasoning improves medical problem-solving and benefits more from reinforcement learning.
- Score: 19.448687758457318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.
Related papers
- Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs [23.50838763761289]
We propose Mentor-Intern Collaborative Search (MICS) to generate rigorous and effective medical chain-of-thought data.<n>The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths.<n>Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy.
arXiv Detail & Related papers (2025-06-20T12:51:19Z) - Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning [20.878972841860975]
We present Med-U1, a unified framework for robust reasoning across medical Question-Answering (QA) tasks.<n>With multi-objective reward optimization, Med-U1 directs LLMs to produce concise and verifiable reasoning chains.<n> Empirical results reveal that Med-U1 significantly improves performance across multiple challenging Med-QA benchmarks.
arXiv Detail & Related papers (2025-06-14T02:00:36Z) - ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [44.96018028534255]
ReasonMed is the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths.<n>We train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
arXiv Detail & Related papers (2025-06-11T08:36:55Z) - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains [52.86636270242863]
This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains.<n>We introduce a fine-grained evaluation framework that judges the correctness of knowledge used and the quality of reasoning.<n>Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains.
arXiv Detail & Related papers (2025-06-02T18:01:00Z) - Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA [17.823588070044217]
We propose Discuss-RAG, a plug-and-play module designed to enhance the medical question answering system.
Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content.
Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG.
arXiv Detail & Related papers (2025-04-30T01:37:44Z) - Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models [6.176432104264649]
Vision-language models (VLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored.
We propose Med-R1, a reinforcement learning (RL)-enhanced vision-language model designed to improve generalization and reliability in medical reasoning.
We evaluate Med-R1 across eight distinct medical imaging modalities.
arXiv Detail & Related papers (2025-03-18T06:12:38Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs [86.79757571440082]
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks.
We identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts.
We propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts.
arXiv Detail & Related papers (2025-01-30T18:58:18Z) - LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models [18.6994780408699]
Large Language Models (LLMs) face significant challenges in medical question answering.
We propose a novel approach incorporating similar case generation within a multi-agent medical question-answering system.
Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data.
arXiv Detail & Related papers (2024-12-31T19:55:45Z) - Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models [0.0]
Despite the current ubiquity of Large Language Models (LLMs) across the medical domain, there is a surprising lack of studies which address their reasoning behaviour.<n>We emphasise the importance of understanding reasoning behaviour as opposed to high-level prediction accuracies, since it is equivalent to explainable AI (XAI) in this context.
arXiv Detail & Related papers (2024-12-20T10:06:52Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - From Beginner to Expert: Modeling Medical Knowledge into General LLMs [22.475129648458136]
Large language model (LLM) based artificial intelligence (AI) systems have demonstrated remarkable capabilities in natural language understanding and generation.
These models face a significant challenge when it comes to sensitive applications, such as reasoning over medical knowledge and answering medical questions in a physician-like manner.
In this work, we start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a medical beginner towards a medical expert (called AntGLM-Med-10B)
arXiv Detail & Related papers (2023-12-02T05:54:06Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Can large language models reason about medical questions? [7.95779617839642]
We investigate whether close- and open-source models can be applied to answer and reason about difficult real-world-based questions.
We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios.
Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge.
arXiv Detail & Related papers (2022-07-17T11:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.