MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise
- URL: http://arxiv.org/abs/2405.11793v1
- Date: Mon, 20 May 2024 05:23:56 GMT
- Title: MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise
- Authors: Ruiqi Wu, Chenran Zhang, Jianle Zhang, Yi Zhou, Tao Zhou, Huazhu Fu,
- Abstract summary: MM-Retinal is a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books.
We present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT.
Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios.
- Score: 36.81785819064916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets. The learning process is usually based on data-driven paradigm without prior knowledge, resulting in poor transferability and generalizability. To address this issue, we propose MM-Retinal, a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books. Moreover, enabled by MM-Retinal, we present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT. It is designed with image similarity-guided text revision and mixed training strategy to infuse expert knowledge. Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios. MM-Retinal and KeepFIT are available at https://github.com/lxirich/MM-Retinal.
Related papers
- Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities [10.031534249264807]
RetCoP is the first continual vision-language pre-training framework in the fundus domain.<n>It incrementally integrates image and text features from different imaging modalities into a single unified foundation model.<n>Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate.
arXiv Detail & Related papers (2025-06-24T05:18:31Z) - BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset [140.1967962502411]
We introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features.<n>A sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation offers practical advantages.<n>Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models.
arXiv Detail & Related papers (2025-05-14T17:11:07Z) - Flux Already Knows -- Activating Subject-Driven Image Generation without Training [25.496237241889048]
We propose a zero-shot framework for subject-driven image generation using a vanilla Flux model.
We activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning.
arXiv Detail & Related papers (2025-04-12T20:41:53Z) - MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining [34.10482034684238]
Vision-language pretraining has been investigated to generalize across diverse downstream tasks for fundus image analysis.
We introduce MM-Retinal V2, a high-quality image-text paired dataset.
We propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets.
arXiv Detail & Related papers (2025-01-27T05:49:06Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation
Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images.
Humans tend to discern authenticity of an image based on semantic and high-frequency information.
Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Learning Feature Disentanglement and Dynamic Fusion for Recaptured Image
Forensic [7.820667552233989]
We explicitly redefine image recapture forensic task as four patterns of image recapture recognition, i.e., moire recapture, edge recapture, artifact recapture, and other recapture.
We propose a novel Feature Disentanglement and Dynamic Fusion (FDDF) model to adaptively learn the most effective recapture representation for covering different recapture pattern recognition.
To the best of our knowledge, we are the first to propose a general model and a general real-scene large-scale dataset for recaptured image forensic.
arXiv Detail & Related papers (2022-06-13T12:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.