Related papers: Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning

Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning

URL: http://arxiv.org/abs/2501.05862v1
Date: Fri, 10 Jan 2025 10:59:27 GMT
Title: Language-Inspired Relation Transfer for Few-shot Class-Incremental Learning
Authors: Yifan Zhao, Jia Li, Zeyin Song, Yonghong Tian,
Abstract summary: We propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions.<n>Our proposed LRT outperforms the state-of-the-art models by over $13%$ and $7%$ on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.
Score: 42.923762020491495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Depicting novel classes with language descriptions by observing few-shot samples is inherent in human-learning systems. This lifelong learning capability helps to distinguish new knowledge from old ones through the increase of open-world learning, namely Few-Shot Class-Incremental Learning (FSCIL). Existing works to solve this problem mainly rely on the careful tuning of visual encoders, which shows an evident trade-off between the base knowledge and incremental ones. Motivated by human learning systems, we propose a new Language-inspired Relation Transfer (LRT) paradigm to understand objects by joint visual clues and text depictions, composed of two major steps. We first transfer the pretrained text knowledge to the visual domains by proposing a graph relation transformation module and then fuse the visual and language embedding by a text-vision prototypical fusion module. Second, to mitigate the domain gap caused by visual finetuning, we propose context prompt learning for fast domain alignment and imagined contrastive learning to alleviate the insufficient text data during alignment. With collaborative learning of domain alignments and text-image transfer, our proposed LRT outperforms the state-of-the-art models by over $13\%$ and $7\%$ on the final session of mini-ImageNet and CIFAR-100 FSCIL benchmarks.

Related papers

InPK: Infusing Prior Knowledge into Prompt for Vision-Language Models [24.170351966913557]
We propose the InPK model, which infuses class-specific prior knowledge into the learnable tokens.<n>We also introduce a learnable text-to-vision projection layer to accommodate the text adjustments.<n>In experiments, InPK significantly outperforms state-of-the-art methods in multiple zero/few-shot image classification tasks.
arXiv Detail & Related papers (2025-02-27T05:33:18Z)
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.<n>We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.<n>Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems. In training, it enriches entities in natural language with WordNet and Wiktionary knowledge. In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation [5.064384692591668]
This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment.
arXiv Detail & Related papers (2021-09-04T22:48:46Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.