DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models
- URL: http://arxiv.org/abs/2506.01257v1
- Date: Mon, 02 Jun 2025 02:17:04 GMT
- Title: DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models
- Authors: Jiancheng Ye, Sophie Bronstein, Jiarui Hai, Malak Abu Hashish,
- Abstract summary: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek.<n>Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models.<n>It excels in structured problem-solving domains such as mathematics, healthcare diagnostics, code generation, and pharmaceutical research.
- Score: 4.506083131558209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DeepSeek-R1 is a cutting-edge open-source large language model (LLM) developed by DeepSeek, showcasing advanced reasoning capabilities through a hybrid architecture that integrates mixture of experts (MoE), chain of thought (CoT) reasoning, and reinforcement learning. Released under the permissive MIT license, DeepSeek-R1 offers a transparent and cost-effective alternative to proprietary models like GPT-4o and Claude-3 Opus; it excels in structured problem-solving domains such as mathematics, healthcare diagnostics, code generation, and pharmaceutical research. The model demonstrates competitive performance on benchmarks like the United States Medical Licensing Examination (USMLE) and American Invitational Mathematics Examination (AIME), with strong results in pediatric and ophthalmologic clinical decision support tasks. Its architecture enables efficient inference while preserving reasoning depth, making it suitable for deployment in resource-constrained settings. However, DeepSeek-R1 also exhibits increased vulnerability to bias, misinformation, adversarial manipulation, and safety failures - especially in multilingual and ethically sensitive contexts. This survey highlights the model's strengths, including interpretability, scalability, and adaptability, alongside its limitations in general language fluency and safety alignment. Future research priorities include improving bias mitigation, natural language comprehension, domain-specific validation, and regulatory compliance. Overall, DeepSeek-R1 represents a major advance in open, scalable AI, underscoring the need for collaborative governance to ensure responsible and equitable deployment.
Related papers
- Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? [2.010294990327175]
Current AI evaluation practices depend heavily on established benchmarks.<n>These tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape.<n>This research addresses the urgent need to quantify this "benchmark-regulation gap"
arXiv Detail & Related papers (2025-08-07T15:03:39Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench [0.0]
HealthBench is a benchmark designed to measure the capabilities of AI systems for health better.<n>Its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies.<n>We propose anchoring reward functions in version-controlled Clinical Practice Guidelines that incorporate systematic reviews and GRADE evidence ratings.
arXiv Detail & Related papers (2025-07-31T18:16:10Z) - Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making [80.94208848596215]
We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement.<n>Inspired by the catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning.
arXiv Detail & Related papers (2025-05-27T17:59:50Z) - Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies [11.0505830548286]
This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench.<n> Evaluation of 10 leading models reveals vulnerabilities, despite achieving 0.86 accuracy in medical knowledge recall.<n>Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning.
arXiv Detail & Related papers (2025-03-10T13:28:25Z) - Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Generalization of Medical Large Language Models through Cross-Domain Weak Supervision [0.0]
We propose the Incremental Curriculum-Based Fine-Tuning (ICFT) framework to enhance the generative capabilities of medical large language models (MLLMs)<n>ICFT combines curriculum-based learning, dual-stage memory coordination, and parameter-efficient fine-tuning to enable a progressive transition from general linguistic knowledge to strong domain-specific expertise.
arXiv Detail & Related papers (2025-02-02T16:05:23Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality? [8.423877102146433]
"eXplainable and Interpretable Artificial Intelligence" (XIAI) is introduced to distinguish XAI from IAI.
Our analysis shows that attention mechanisms are the most prevalent emerging IAI technique.
The major challenges identified are that most XIAI does not explore "global" modelling processes, the lack of best practices, and the lack of systematic evaluation and benchmarks.
arXiv Detail & Related papers (2024-03-18T15:53:33Z) - Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges [58.32937972322058]
"Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image (MedAI 2021)" competitions.
We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic.
arXiv Detail & Related papers (2023-07-30T16:08:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.