Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities
- URL: http://arxiv.org/abs/2506.19320v1
- Date: Tue, 24 Jun 2025 05:18:31 GMT
- Title: Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities
- Authors: Yuang Yao, Ruiqi Wu, Yi Zhou, Tao Zhou,
- Abstract summary: RetCoP is the first continual vision-language pre-training framework in the fundus domain.<n>It incrementally integrates image and text features from different imaging modalities into a single unified foundation model.<n>Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate.
- Score: 10.031534249264807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.
Related papers
- Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [87.23753533733046]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z) - TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models.
We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization.
Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z) - MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise [36.81785819064916]
MM-Retinal is a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books.
We present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT.
Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-05-20T05:23:56Z) - JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models.
Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy.
We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z) - Self-Enhancement Improves Text-Image Retrieval in Foundation
Visual-Language Models [33.008325765051865]
Cross-modal foundation models fail to focus on the key attributes required for domain-specific retrieval tasks.
We propose a self-enhancement framework, A3R, based on the CLIP-ViT/G-14, one of the largest cross-modal models.
arXiv Detail & Related papers (2023-06-11T14:25:38Z) - The Role of Data Curation in Image Captioning [26.61662352061468]
This paper contributes to this direction by actively curating difficult samples in datasets without increasing the total number of samples.
Experiments on the Flickr30K and COCO datasets with the BLIP and BEiT-3 models demonstrate that these curation methods do indeed yield improved image captioning models.
arXiv Detail & Related papers (2023-05-05T15:16:07Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Mitigating Modality Collapse in Multimodal VAEs via Impartial
Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training.
We show how to detect the sub-graphs in the computational graphs where gradients conflict.
We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z) - StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation.
We learn a latent embedding, jointly with the generator, that models the variability of the output domain.
Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z) - The Power of Triply Complementary Priors for Image Compressive Sensing [89.14144796591685]
We propose a joint low-rank deep (LRD) image model, which contains a pair of complementaryly trip priors.
We then propose a novel hybrid plug-and-play framework based on the LRD model for image CS.
To make the optimization tractable, a simple yet effective algorithm is proposed to solve the proposed H-based image CS problem.
arXiv Detail & Related papers (2020-05-16T08:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.