Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models
- URL: http://arxiv.org/abs/2602.04355v1
- Date: Wed, 04 Feb 2026 09:25:07 GMT
- Title: Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models
- Authors: Sichu Liang, Hongyu Zhu, Wenwen Wang, Deyu Zhou,
- Abstract summary: Working memory is a central component of intelligent behavior.<n>Recent work has used n-back tasks to probe working-memory-like behavior in large language models.<n>We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids.
- Score: 24.58621679734274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
Related papers
- Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions [62.02112656288921]
reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios.<n>We learn a compact latent action space for RL fine-tuning instead.<n>We leverage both paired image-text data and text-only data to construct the latent action space.
arXiv Detail & Related papers (2026-01-12T13:13:24Z) - JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation [22.956416709470503]
Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream.<n>Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models.<n>We propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations.
arXiv Detail & Related papers (2025-09-26T16:29:37Z) - Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs [49.42020616826156]
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs, yet demonstrate higher accuracies when performing an analogous task on text.<n>We investigate this accuracy gap by identifying and comparing the textitcircuits in different modalities.<n>To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers.
arXiv Detail & Related papers (2025-06-10T17:59:21Z) - Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection.
There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task.
We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video.
We propose a novel framework inspired by the cognitive information processing of humans.
Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Memory-Based Semantic Parsing [79.48882899104997]
We present a memory-based model for context-dependent semantic parsing.
We learn a context memory controller that manages the memory by maintaining the cumulative meaning of sequential user utterances.
arXiv Detail & Related papers (2021-09-07T16:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.