ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
- URL: http://arxiv.org/abs/2505.04192v2
- Date: Mon, 13 Oct 2025 15:50:05 GMT
- Title: ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
- Authors: Trinh T. L. Vuong, Jin Tae Kwak,
- Abstract summary: We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology.<n>It integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos.<n>By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning.
- Score: 2.832420256346882
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present ViDRiP-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, including single patch images, automatically segmented pathology video clips, and manually segmented pathology videos. This integration closely mirrors the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, ViDRiP-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the ViDRiP-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. ViDRiP-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at: https://github.com/QuIIL/ViDRiP-LLaVA.
Related papers
- MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling [51.31633278218137]
MedScope is a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures.<n>We build ClinVideoSuite, an evidence-centric, fine-grained clinical video suite.<n>On full and fine-grained video understanding benchmarks, MedScope achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-11T09:47:02Z) - HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs [11.582033617887573]
We propose textbfHyperWalker, a framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training.<n> Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-01-20T12:48:09Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis [9.728322291979564]
We propose PathMR, a cell-level Multimodal visual Reasoning framework for Pathological image analysis.<n>We show that PathMR consistently outperforms state-of-the-art visual reasoning methods in text generation quality, segmentation accuracy, and cross-modal alignment.
arXiv Detail & Related papers (2025-08-28T14:46:24Z) - MedVQA-TREE: A Multimodal Reasoning and Retrieval Framework for Sarcopenia Prediction [1.7775777785480917]
We propose MedVQA-TREE, a framework that integrates a hierarchical image interpretation module, a gated feature-level fusion mechanism, and a novel multi-hop, multi-query retrieval strategy.<n>A gated fusion mechanism selectively integrates visual features with textual queries, while clinical knowledge is retrieved through a UMLS-guided pipeline accessing PubMed and a sarcopenia-specific external knowledge base.<n>The model achieved up to 99% diagnostic accuracy and outperformed previous state-of-the-art methods by over 10%.
arXiv Detail & Related papers (2025-08-26T13:31:01Z) - DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning [58.70446237944036]
DAVID-X is the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales.<n>We present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning.<n>Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
arXiv Detail & Related papers (2025-06-13T13:39:53Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection [91.97935118185]
We propose Video-to-Image knowledge distillation for the task of medical video lesion detection.
By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models.
V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.
arXiv Detail & Related papers (2024-08-26T07:17:05Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos [11.913023311613884]
We introduce Quilt-Instruct, a large-scale dataset of histopathology-specific instruction question/answer pairs.<n>Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch.
arXiv Detail & Related papers (2023-12-07T23:16:37Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - OnUVS: Online Feature Decoupling Framework for High-Fidelity Ultrasound
Video Synthesis [34.07625938756013]
Sonographers must observe corresponding dynamic anatomic structures to gather comprehensive information.
The synthesis of US videos may represent a promising solution to this issue.
We present a novel online feature-decoupling framework called OnUVS for high-fidelity US video synthesis.
arXiv Detail & Related papers (2023-08-16T10:16:50Z) - Pixel-Level Explanation of Multiple Instance Learning Models in
Biomedical Single Cell Images [52.527733226555206]
We investigate the use of four attribution methods to explain a multiple instance learning models.
We study two datasets of acute myeloid leukemia with over 100 000 single cell images.
We compare attribution maps with the annotations of a medical expert to see how the model's decision-making differs from the human standard.
arXiv Detail & Related papers (2023-03-15T14:00:11Z) - PS-DeVCEM: Pathology-sensitive deep learning model for video capsule
endoscopy based on weakly labeled data [0.0]
We propose a pathology-sensitive deep learning model (PS-DeVCEM) for frame-level anomaly detection and multi-label classification of different colon diseases in video capsule endoscopy (VCE) data.
Our model is driven by attention-based deep multiple instance learning and is trained end-to-end on weakly labeled data.
We show our model's ability to temporally localize frames with pathologies, without frame annotation information during training.
arXiv Detail & Related papers (2020-11-22T15:33:37Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Ultrasound Video Summarization using Deep Reinforcement Learning [12.320114045092291]
We introduce a fully automatic video summarization method tailored to the needs of medical video data.
We show that our method is superior to alternative video summarization methods and that it preserves essential information required by clinical diagnostic standards.
arXiv Detail & Related papers (2020-05-19T15:44:18Z) - Multi-Modal Video Forensic Platform for Investigating Post-Terrorist
Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence.
We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.