HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models
- URL: http://arxiv.org/abs/2406.14098v2
- Date: Fri, 5 Jul 2024 01:56:29 GMT
- Title: HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models
- Authors: Xinrui Zhou, Yuhao Huang, Wufeng Xue, Haoran Dou, Jun Cheng, Han Zhou, Dong Ni,
- Abstract summary: We propose a novel framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis.
HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation.
In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals.
- Score: 14.280181445804226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
Related papers
- EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation [1.0840985826142429]
We introduce EchoPrime, a multi-view, view-informed, video-based vision-language foundation model trained on over 12 million video-report pairs.
With retrieval-augmented interpretation, EchoPrime integrates information from all echocardiogram videos in a comprehensive study.
In datasets from two independent healthcare systems, EchoPrime achieves state-of-the art performance on 23 diverse benchmarks of cardiac form and function.
arXiv Detail & Related papers (2024-10-13T03:04:22Z) - ECHOPulse: ECG controlled echocardio-grams video generation [30.753399869167588]
Echocardiography (ECHO) is essential for cardiac assessments.
ECHO video generation offers a solution by improving automated monitoring.
ECHOPULSE is an ECG-conditioned ECHO video generation model.
arXiv Detail & Related papers (2024-10-04T04:49:56Z) - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - Explainable and Controllable Motion Curve Guided Cardiac Ultrasound Video Generation [11.879436948659691]
We propose an explainable and controllable method for echocardiography video generation.
First, we extract motion information from each heart substructure to construct motion curves.
Second, we propose the structure-to-motion alignment module, which can map semantic features onto motion curves.
Third, The position-aware attention mechanism is designed to enhance video consistency utilizing Gaussian masks with structural position information.
arXiv Detail & Related papers (2024-07-31T09:59:20Z) - NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation [55.51412454263856]
This paper proposes to directly modulate the generation process of diffusion models using fMRI signals.
By training with about 67,000 fMRI-image pairs from various individuals, our model enjoys superior fMRI-to-image decoding capacity.
arXiv Detail & Related papers (2024-03-27T02:42:52Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Weakly-supervised High-fidelity Ultrasound Video Synthesis with Feature
Decoupling [13.161739586288704]
In clinical practice, analysis and diagnosis often rely on US sequences rather than a single image to obtain dynamic anatomical information.
This is challenging for novices to learn because practicing with adequate videos from patients is clinically unpractical.
We propose a novel framework to synthesize high-fidelity US videos.
arXiv Detail & Related papers (2022-07-01T14:53:22Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Echo-SyncNet: Self-supervised Cardiac View Synchronization in
Echocardiography [11.407910072022018]
We propose Echo-Sync-Net, a self-supervised learning framework to synchronize various cross-of-care 2D echo series without any external input.
We show promising results for synchronizing Apical 2 chamber and Apical 4 chamber cardiac views.
We also show the usefulness of the learned representations in a one-shot learning scenario of cardiac detection.
arXiv Detail & Related papers (2021-02-03T20:48:16Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.