MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
- URL: http://arxiv.org/abs/2507.05675v1
- Date: Tue, 08 Jul 2025 04:58:36 GMT
- Title: MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos
- Authors: Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang,
- Abstract summary: We introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation.<n>It comprises over 55,000 curated clips spanning real-world medical scenarios.<n>Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models.
- Score: 16.86256309424395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen
Related papers
- MedGemma Technical Report [76.28035910388918]
We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B.<n>MedGemma demonstrates advanced medical understanding and reasoning on images and text.<n>We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP.
arXiv Detail & Related papers (2025-07-07T17:01:44Z) - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks.<n>UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs.<n>We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z) - Artificial Intelligence for Biomedical Video Generation [8.21248952391087]
Introduction of Sora-alike models represents a pivotal breakthrough in video generation technologies.
Video generation technology has shown immense potential such as medical concept explanation, disease simulation, and biomedical data augmentation.
arXiv Detail & Related papers (2024-11-12T08:05:58Z) - Bora: Biomedical Generalist Video Generation Model [20.572771714879856]
This paper introduces Bora, first model designed for text-guided biomedical video generation.
It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus.
Bora is capable of generating high-quality video data across four distinct biomedical domains.
arXiv Detail & Related papers (2024-07-12T03:00:25Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Endora: Video Generation Models as Endoscopy Simulators [53.72175969751398]
This paper introduces model, an innovative approach to generate medical videos that simulate clinical endoscopy scenes.
We also pioneer the first public benchmark for endoscopy simulation with video generation models.
Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research.
arXiv Detail & Related papers (2024-03-17T00:51:59Z) - Med-Flamingo: a Multimodal Medical Few-shot Learner [58.85676013818811]
We propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain.
Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
We conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app.
arXiv Detail & Related papers (2023-07-27T20:36:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.