MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling
- URL: http://arxiv.org/abs/2602.13332v1
- Date: Wed, 11 Feb 2026 09:47:02 GMT
- Title: MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling
- Authors: Wenjie Li, Yujie Zhang, Haoran Sun, Xingqi He, Hongcheng Gao, Chenglong Ma, Ming Hu, Guankun Wang, Shiyi Yao, Renhao Yang, Hongliang Ren, Lei Wang, Junjun He, Yankai Jiang,
- Abstract summary: MedScope is a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures.<n>We build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite.<n>On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance.
- Score: 51.31633278218137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose MedScope, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite. We then optimize MedScope with Grounding-Aware Group Relative Policy Optimization (GA-GRPO), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely "think with videos" through tool-integrated reasoning. We will release our code, models, and data.
Related papers
- A Very Big Video Reasoning Suite [155.70016888896927]
Rapid in video models has largely captured visual quality, leaving their reasoning capabilities underexplored.<n>The Very Big Video Reasoning (VBVR) dataset is an unprecedentedly large-scale resource spanning 200 curated reasoning tasks.<n>VBVR-Bench is a verifiable evaluation framework that moves beyond model-based judging by rule-based, human-aligned scorers.
arXiv Detail & Related papers (2026-02-23T18:59:41Z) - Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening [8.010714901985898]
Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection.<n>Current screening methods are subjective, difficult to scale, and reliant on specialized clinical expertise.<n>Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage.<n>ScoliGait is a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing.
arXiv Detail & Related papers (2026-02-06T14:44:22Z) - MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data [32.65971100171597]
We introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data.<n>We also present MedGround-35K, a novel multimodal medical dataset.
arXiv Detail & Related papers (2026-01-11T10:34:18Z) - Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z) - LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z) - EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation [23.197431495208672]
EchoAgent is a framework that enables structured, interpretable automation for echocardiographic video analysis.<n>It orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation.<n>It achieves accurate, interpretable results despite added complexity oftemporal video analysis.
arXiv Detail & Related papers (2025-11-17T22:06:12Z) - MedBrowseComp: Benchmarking Medical Deep Research and Computer Use [10.565661515629412]
MedBrowseComp is a benchmark that systematically tests an agent's ability to retrieve and synthesize medical facts.<n>It contains more than 1,000 human-curated questions that mirror clinical scenarios.<n>Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent.
arXiv Detail & Related papers (2025-05-20T22:42:33Z) - ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos [2.832420256346882]
We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology.<n>It integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos.<n>By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning.
arXiv Detail & Related papers (2025-05-07T07:41:19Z) - VITED: Video Temporal Evidence Distillation [49.38292490256531]
We investigate complex video question answering via chain-of-evidence reasoning.<n>Models struggle with multi-step reasoning as they uniformly sample a fixed number of frames.<n>We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains.
arXiv Detail & Related papers (2025-03-17T06:30:02Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.