Related papers: Towards General Continuous Memory for Vision-Language Models

Towards General Continuous Memory for Vision-Language Models

URL: http://arxiv.org/abs/2505.17670v2
Date: Mon, 07 Jul 2025 20:01:47 GMT
Title: Towards General Continuous Memory for Vision-Language Models
Authors: Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang,
Abstract summary: Language models (LMs) and their extension, vision-language models (VLMs) have achieved remarkable performance across various tasks.<n>They still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge.<n>We propose using continuous memory, a compact set of dense embeddings to represent multimodal and multilingual knowledge.<n>Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings.
Score: 39.95345066340921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model's parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM's original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.

Related papers

Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models [26.305881774348844]
multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities.<n>In vision-language models, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features.<n>We propose SparseCut, a general cross-modal fusion architecture for MLLMs.
arXiv Detail & Related papers (2026-01-31T04:15:42Z)
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents [76.76004970226485]
Long-term memory is a critical capability for multimodal large language model (MLLM) agents.<n>Mem-Gallery is a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents.
arXiv Detail & Related papers (2026-01-07T02:03:13Z)
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems [71.32550994522738]
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during dialogues.<n>MemLoRA is a novel memory system that integrates small Vision-Language Models.<n>VLM-integrated MemLoRA-V shows massive improvements in caption-based approaches.
arXiv Detail & Related papers (2025-12-04T12:56:30Z)
BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion [6.8723394189831035]
Large language models pose challenges for deployment in resource-constrained environments.<n>We propose a lightweight MLLM framework for end-to-end visual question answering.<n>Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language optimised for efficient multimodal understanding.
arXiv Detail & Related papers (2025-09-10T16:09:49Z)
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding [40.784423313750075]
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios.<n>We propose a novel positional encoding approach that employs variable increments for visual tokens, enabling more efficient management of long multimodal sequences.<n>We show that the fine-tuned model achieves strong performance on both standard and long-context multimodal tasks.
arXiv Detail & Related papers (2024-12-12T18:59:46Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion [40.56646959926701]
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models.<n>Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders.<n>We introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs.
arXiv Detail & Related papers (2024-12-02T09:02:28Z)
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models. Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.<n>EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.