Dual Latent Memory for Visual Multi-agent System
- URL: http://arxiv.org/abs/2602.00471v1
- Date: Sat, 31 Jan 2026 02:49:10 GMT
- Title: Dual Latent Memory for Visual Multi-agent System
- Authors: Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan,
- Abstract summary: Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration.<n>Increasing agent turns often degrades performance while exponentially inflating token costs.<n>We propose L$2$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories.
- Score: 69.29799381195592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L$^{2}$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: https://github.com/YU-deep/L2-VMAS.
Related papers
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z) - M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval [64.06936170117943]
M$2$ is a training-free, memory-augmented framework designed to optimize context efficiency and decision-making.<n>Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank.
arXiv Detail & Related papers (2026-02-28T06:59:51Z) - LatentMem: Customizing Latent Memory for Multi-Agent Systems [44.59989123744384]
We propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner.<n>Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts.
arXiv Detail & Related papers (2026-02-03T03:03:16Z) - FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory [4.608947574766633]
We propose FadeMem, a biologically-inspired agent memory architecture that incorporates active forgetting mechanisms mirroring human cognitive efficiency.<n>Experiments on Multi-Session Chat, LoCoMo, and LTI-Bench demonstrate superior multi-hop reasoning and retrieval with 45% storage reduction.
arXiv Detail & Related papers (2026-01-26T16:12:54Z) - Agentic Learner with Grow-and-Refine Multimodal Semantic Memory [50.81667005063605]
ViLoMem is a dual-stream memory framework that constructs compact, schema-based memory.<n>It encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences.
arXiv Detail & Related papers (2025-11-26T18:55:08Z) - ExplicitLM: Decoupling Knowledge from Parameters via Explicit Memory Banks [4.099810580680816]
Large language models suffer from knowledge staleness and lack of interpretability due to implicit knowledge storage.<n>We propose ExplicitLM, a novel architecture featuring a million-scale external memory bank storing human-readable knowledge as token sequences.
arXiv Detail & Related papers (2025-11-03T13:53:19Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents [60.095739427926524]
Long videos, characterized by temporal and sparse task-relevant information, pose significant reasoning challenges for AI systems.<n>Inspired by human progressive visual cognition, we propose CogniGPT for efficient and reliable long video understanding.
arXiv Detail & Related papers (2025-09-29T15:42:55Z) - Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers [0.0]
Large language models (LLMs) aligned for safety often exhibit emergent deceptive behaviors.<n>This paper introduces adversarial activation patching, a novel mechanistic interpretability framework.<n>By sourcing activations from "deceptive" prompts, we simulate vulnerabilities and quantify deception rates.
arXiv Detail & Related papers (2025-07-12T21:29:49Z) - Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection [5.195291754828701]
Collaborative perception allows real-time inter-agent information exchange.<n>limited communication bandwidth in practical scenarios restricts the inter-agent data transmission volume.<n>We propose Which2comm, a novel multi-agent 3D object detection framework leveraging object-level sparse features.
arXiv Detail & Related papers (2025-03-21T14:24:07Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.