Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
- URL: http://arxiv.org/abs/2506.02126v1
- Date: Mon, 02 Jun 2025 18:01:00 GMT
- Title: Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
- Authors: Juncheng Wu, Sheng Liu, Haoqin Tu, Hang Yu, Xiaoke Huang, James Zou, Cihang Xie, Yuyin Zhou,
- Abstract summary: This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains.<n>We introduce a fine-grained evaluation framework that judges the correctness of knowledge used and the quality of reasoning.<n>Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains.
- Score: 52.86636270242863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in reasoning-enhanced Large Language Models such as OpenAI-o1/3 and DeepSeek-R1 have significantly improved performance on complex tasks. However, the quality and transparency of their internal reasoning processes remain underexplored. This work moves beyond the final-answer accuracy and investigates step-by-step reasoning in the medical and mathematical domains by explicitly decomposing the thinking trajectories into two parts: knowledge and reasoning. Specifically, we introduce a fine-grained evaluation framework that judges: (1) the correctness of knowledge used (measured by Knowledge Index (KI)) and (2) the quality of reasoning (measured by Information Gain (InfoGain)). Using this framework, we study R1-distilled and base Qwen models trained with supervised fine-tuning (SFT) and/or reinforcement learning (RL) in the medical and math domains. Three intriguing findings emerge: (1) The general reasoning abilities in R1-distilled models do not transfer effectively to the medical domain through either SFT or RL. (2) SFT raises final-answer accuracy in both domains, but often at the cost of reasoning quality: InfoGain drops by 38.9% on average compared with untrained models; In the medical domain, however, SFT remains crucial because domain knowledge is indispensable. (3) RL enhances medical reasoning by pruning inaccurate or irrelevant knowledge from reasoning paths, thereby improving both reasoning accuracy and knowledge correctness.
Related papers
- Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning [31.58210903685538]
We introduce **Med-R$3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning.<n>In this framework, we first develop the model's ability to perform logical reasoning over medical problems.<n>We then adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization.
arXiv Detail & Related papers (2025-07-31T13:31:01Z) - FairReason: Balancing Reasoning and Social Bias in MLLMs [50.618158642714505]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z) - CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making [42.28216499263317]
We introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks.<n>We propose a novel large-scale RL framework for Med-VLMs, which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer, and rule-based accuracy for final responses.
arXiv Detail & Related papers (2025-06-15T13:42:46Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs [87.24843751412783]
We propose BARREL, a framework that promotes concise and boundary-aware factual reasoning.<n>Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%.
arXiv Detail & Related papers (2025-05-18T07:27:34Z) - RARE: Retrieval-Augmented Reasoning Modeling [41.24577920467858]
We propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization.<n>RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training.<n>Experiments demonstrate that lightweight-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20% accuracy.
arXiv Detail & Related papers (2025-03-30T16:49:44Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs [19.448687758457318]
HuatuoGPT-o1, a medical LLM capable of complex reasoning, outperforms general and medical-specific baselines using only 40K verifiable problems.<n> Experiments show complex reasoning improves medical problem-solving and benefits more from reinforcement learning.
arXiv Detail & Related papers (2024-12-25T15:12:34Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Explainable, Domain-Adaptive, and Federated Artificial Intelligence in
Medicine [5.126042819606137]
We focus on three key methodological approaches that address some of the particular challenges in AI-driven medical decision making.
Domain adaptation and transfer learning enable AI models to be trained and applied across multiple domains.
Federated learning enables learning large-scale models without exposing sensitive personal health information.
arXiv Detail & Related papers (2022-11-17T03:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.