MINT: Memory-Infused Prompt Tuning at Test-time for CLIP
- URL: http://arxiv.org/abs/2506.03190v1
- Date: Sat, 31 May 2025 07:31:20 GMT
- Title: MINT: Memory-Infused Prompt Tuning at Test-time for CLIP
- Authors: Jiaming Yi, Ruirui Pan, Jishen Yang, Xiulong Yang,
- Abstract summary: Existing Test-Time Adaptation methods fall short in fully leveraging the model's internal knowledge.<n>Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank.<n>MINT enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory.
- Score: 2.117421588033177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving the generalization ability of Vision-Language Pre-trained Models (VLMs) under test-time data distribution shifts remains a critical challenge. The existing Test-Time Adaptation (TTA) methods fall short in fully leveraging the model's internal knowledge, particularly in dynamically adapting to complex and hierarchical visual semantic information. In this paper, we propose Memory-Infused Prompt Tuning (MINT), a novel framework to address this issue. Inspired by human associative memory theory, MINT introduces a Memory Prompt Bank (MPB), which stores learnable key-value prompt pairs that work as a memory of previously seen samples. During the test time, relevant prompt pairs in the MPB are retrieved by the hierarchical visual features of test images to dynamically assemble Associative Prompts. The associative prompts are then injected into the image encoder for fine-grained, customized visual contextual guidance. MINT also utilizes learnable text prompts. MINT thus enables rapid, precise VLM adaptation at test time by leveraging this MPB-acquired memory, without source data or retraining. The code is available at https://github.com/Jamieyi2004/MINT.
Related papers
- Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions [55.19217798774033]
Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents.<n>In this survey, we first categorize memory representations into parametric and contextual forms.<n>We then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression.
arXiv Detail & Related papers (2025-05-01T17:31:33Z) - ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.<n>We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.<n>We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer [76.39111896665585]
Incremental Learning (IL) aims to learn deep models on sequential tasks continually.
Recent vast pre-trained models (PTMs) have achieved outstanding performance by prompt technique in practical IL without the old samples.
arXiv Detail & Related papers (2024-07-04T10:37:58Z) - Semantic Residual Prompts for Continual Learning [21.986800282078498]
We show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test.
Our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model.
arXiv Detail & Related papers (2024-03-11T16:23:38Z) - MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained
Language Models [10.783764497590473]
Transformer-based language models (LMs) track contextual information through large, hard-coded input windows.
We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors.
tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history.
arXiv Detail & Related papers (2024-02-23T11:30:39Z) - TF-CLIP: Learning Text-free CLIP for Video-based Person
Re-Identification [60.5843635938469]
We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID.
More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature.
Our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.
arXiv Detail & Related papers (2023-12-15T09:10:05Z) - Compound Text-Guided Prompt Tuning via Image-Adaptive Cues [42.248853198953945]
We propose Compound Text-Guided Prompt Tuning (TGP-T)
It significantly reduces resource demand while achieving superior performance.
It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet.
arXiv Detail & Related papers (2023-12-11T14:17:02Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - Classification and Generation of real-world data with an Associative
Memory Model [0.0]
We extend the capabilities of the basic Associative Memory Model by using a Multiple-Modality framework.
By storing both the images and labels as modalities, a single Memory can be used to retrieve and complete patterns.
arXiv Detail & Related papers (2022-07-11T12:51:27Z) - HM4: Hidden Markov Model with Memory Management for Visual Place
Recognition [54.051025148533554]
We develop a Hidden Markov Model approach for visual place recognition in autonomous driving.
Our algorithm, dubbed HM$4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory.
We show that this allows constant time and space inference for a fixed coverage area.
arXiv Detail & Related papers (2020-11-01T08:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.