Related papers: State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

URL: http://arxiv.org/abs/2602.15858v1
Date: Sun, 25 Jan 2026 17:41:26 GMT
Title: State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models
Authors: Annie Wong, Aske Plaat, Thomas Bäck, Niki van Stein, Anna V. Kononova,
Abstract summary: We show that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself.<n>We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons.
Score: 2.869209069091683
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

Related papers

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models [33.99822400076112]
Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization.<n>We present a systematic dissection of these models to identify the core components driving their performance.<n>We establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong.
arXiv Detail & Related papers (2025-10-20T06:17:57Z)
VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning [10.497961559068493]
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes.<n>Existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage.<n>VisualTrans is the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios.
arXiv Detail & Related papers (2025-08-06T03:07:05Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models [57.92316645992816]
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space.<n>We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs)<n>We demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
arXiv Detail & Related papers (2024-12-03T06:15:04Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Bidirectional Representations for Low Resource Spoken Language Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings. The approach uses a masked language modelling objective to learn the representations. We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z)
Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality. We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z)
Coalescing Global and Local Information for Procedural Text Understanding [70.10291759879887]
A complete procedural understanding solution should combine three core aspects: local and global views of the inputs, and global view of outputs. In this paper, we propose Coalescing Global and Local InformationCG, a new model that builds entity and time representations. Experiments on a popular procedural text understanding dataset show that our model achieves state-of-the-art results.
arXiv Detail & Related papers (2022-08-26T19:16:32Z)
Generating Coherent Narratives by Learning Dynamic and Discrete Entity States with a Contrastive Framework [68.1678127433077]
We extend the Transformer model to dynamically conduct entity state updates and sentence realization for narrative generation. Experiments on two narrative datasets show that our model can generate more coherent and diverse narratives than strong baselines.
arXiv Detail & Related papers (2022-08-08T09:02:19Z)
Generative Counterfactuals for Neural Networks via Attribute-Informed Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP) By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently. Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
Robust and Interpretable Grounding of Spatial References with Relation Networks [40.42540299023808]
Learning representations of spatial references in natural language is a key challenge in tasks like autonomous navigation and robotic manipulation. Recent work has investigated various neural architectures for learning multi-modal representations for spatial concepts. We develop effective models for understanding spatial references in text that are robust and interpretable.
arXiv Detail & Related papers (2020-05-02T04:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.