Related papers: Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

URL: http://arxiv.org/abs/2602.04355v1
Date: Wed, 04 Feb 2026 09:25:07 GMT
Title: Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models
Authors: Sichu Liang, Hongyu Zhu, Wenwen Wang, Deyu Zhou,
Abstract summary: Working memory is a central component of intelligent behavior.<n>Recent work has used n-back tasks to probe working-memory-like behavior in large language models.<n>We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids.
Score: 24.58621679734274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.

Related papers

Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions [62.02112656288921]
reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios.<n>We learn a compact latent action space for RL fine-tuning instead.<n>We leverage both paired image-text data and text-only data to construct the latent action space.
arXiv Detail & Related papers (2026-01-12T13:13:24Z)
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation [22.956416709470503]
Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream.<n>Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models.<n>We propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations.
arXiv Detail & Related papers (2025-09-26T16:29:37Z)
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs [49.42020616826156]
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs, yet demonstrate higher accuracies when performing an analogous task on text.<n>We investigate this accuracy gap by identifying and comparing the textitcircuits in different modalities.<n>To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers.
arXiv Detail & Related papers (2025-06-10T17:59:21Z)
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z)
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection. There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video. We propose a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Memory-Based Semantic Parsing [79.48882899104997]
We present a memory-based model for context-dependent semantic parsing. We learn a context memory controller that manages the memory by maintaining the cumulative meaning of sequential user utterances.
arXiv Detail & Related papers (2021-09-07T16:15:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.