Memory-Inspired Temporal Prompt Interaction for Text-Image
Classification
- URL: http://arxiv.org/abs/2401.14856v1
- Date: Fri, 26 Jan 2024 13:36:12 GMT
- Title: Memory-Inspired Temporal Prompt Interaction for Text-Image
Classification
- Authors: Xinyao Yu, Hao Sun, Ziwei Niu, Rui Qin, Zhenjia Bai, Yen-Wei Chen,
Lanfen Lin
- Abstract summary: We propose a novel prompt-based multimodal interaction strategy inspired by human memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP)
We utilize temporal prompts on intermediate layers to imitate the acquiring stage, leverage similarity-based prompt interaction to imitate memory consolidation, and employ prompt generation strategy to imitate memory activation.
We achieve competitive results on several datasets with relatively small memory usage and 2.0M of trainable parameters.
- Score: 13.449375069856684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large-scale pre-trained multimodal models (LMM) generally
emerge to integrate the vision and language modalities, achieving considerable
success in various natural language processing and computer vision tasks. The
growing size of LMMs, however, results in a significant computational cost for
fine-tuning these models for downstream tasks. Hence, prompt-based interaction
strategy is studied to align modalities more efficiently. In this contex, we
propose a novel prompt-based multimodal interaction strategy inspired by human
memory strategy, namely Memory-Inspired Temporal Prompt Interaction (MITP). Our
proposed method involves in two stages as in human memory strategy: the
acquiring stage, and the consolidation and activation stage. We utilize
temporal prompts on intermediate layers to imitate the acquiring stage,
leverage similarity-based prompt interaction to imitate memory consolidation,
and employ prompt generation strategy to imitate memory activation. The main
strength of our paper is that we interact the prompt vectors on intermediate
layers to leverage sufficient information exchange between modalities, with
compressed trainable parameters and memory usage. We achieve competitive
results on several datasets with relatively small memory usage and 2.0M of
trainable parameters (about 1% of the pre-trained foundation model).
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Human-like Episodic Memory for Infinite Context LLMs [13.211261438927798]
Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts.
In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs.
EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement.
arXiv Detail & Related papers (2024-07-12T17:34:03Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Interactive Continual Learning: Fast and Slow Thinking [19.253164551254734]
This paper presents a novel Interactive Continual Learning framework, enabled by collaborative interactions among models of various sizes.
To improve memory retrieval in System1, we introduce the CL-vMF mechanism, based on the von Mises-Fisher (vMF) distribution.
Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods.
arXiv Detail & Related papers (2024-03-05T03:37:28Z) - UniMC: A Unified Framework for Long-Term Memory Conversation via
Relevance Representation Learning [15.313416157905685]
We propose a Unified framework for Long-term Memory Conversations (UniMC)
We decompose the main task into three subtasks based on probability graphs.
Each subtask involves learning a representation for calculating the relevance between the query and memory.
arXiv Detail & Related papers (2023-06-18T12:30:50Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of
Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM)
In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug.
Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Improving Meta-learning for Low-resource Text Classification and
Generation via Memory Imitation [87.98063273826702]
We propose a memory imitation meta-learning (MemIML) method that enhances the model's reliance on support sets for task adaptation.
A theoretical analysis is provided to prove the effectiveness of our method.
arXiv Detail & Related papers (2022-03-22T12:41:55Z) - METEOR: Learning Memory and Time Efficient Representations from
Multi-modal Data Streams [19.22829945777267]
We present METEOR, a novel MEmory and Time Efficient Online Representation learning technique.
We show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.
arXiv Detail & Related papers (2020-07-23T08:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.