Related papers: Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts

Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts

URL: http://arxiv.org/abs/2508.03094v1
Date: Tue, 05 Aug 2025 05:15:54 GMT
Title: Augmenting Continual Learning of Diseases with LLM-Generated Visual Concepts
Authors: Jiantao Tan, Peixian Ma, Kanghao Chen, Zhiming Dai, Ruixuan Wang,
Abstract summary: We propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance.<n>Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy.<n>Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification.
Score: 1.1883838320818292
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Continual learning is essential for medical image classification systems to adapt to dynamically evolving clinical environments. The integration of multimodal information can significantly enhance continual learning of image classes. However, while existing approaches do utilize textual modality information, they solely rely on simplistic templates with a class name, thereby neglecting richer semantic information. To address these limitations, we propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance. Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy. Then, to integrate the concepts into the continual learning process, we employ a cross-modal image-concept attention module, coupled with an attention loss. Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification. Experiments on medical and natural image datasets show our method achieves state-of-the-art performance, demonstrating the effectiveness and superiority of our method. We will release the code publicly.

Related papers

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes.<n>Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features.<n>This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z)
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations [13.991376926757036]
We propose MedUnifier, a unified Vision-Language Pre-Training framework tailored for medical data.<n>MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies.<n>Our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality.
arXiv Detail & Related papers (2025-03-02T21:09:32Z)
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z)
Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.<n>Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z)
A Classifier-Free Incremental Learning Framework for Scalable Medical Image Segmentation [6.591403935303867]
We introduce a novel segmentation paradigm enabling the segmentation of a variable number of classes within a single classifier-free network. This network is trained using contrastive learning and produces discriminative feature representations that facilitate straightforward interpretation. We demonstrate the flexibility of our method in handling varying class numbers within a unified network and its capacity for incremental learning.
arXiv Detail & Related papers (2024-05-25T19:05:07Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training [6.582001681307021]
We propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo) KoBo integrates clinical knowledge into the learning of vision-language semantic consistency. Experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness.
arXiv Detail & Related papers (2023-07-14T09:38:22Z)
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings. We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z)
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models. SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.