Related papers: Emergence of Abstract State Representations in Embodied Sequence Modeling

Emergence of Abstract State Representations in Embodied Sequence Modeling

URL: http://arxiv.org/abs/2311.02171v2
Date: Tue, 7 Nov 2023 05:03:08 GMT
Title: Emergence of Abstract State Representations in Embodied Sequence Modeling
Authors: Tian Yun, Zilai Zeng, Kunal Handa, Ashish V. Thapliyal, Bo Pang, Ellie Pavlick, Chen Sun
Abstract summary: Sequence modeling aims to mimic the success of language models, where actions are modeled as tokens to predict. We show that environmental layouts can be reasonably reconstructed from the internal activations of a trained model. Our results support an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.
Score: 24.827284626429964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict. Despite their promising performance, it remains unclear if embodied sequence modeling leads to the emergence of internal representations that represent the environmental state information. A model that lacks abstract state representations would be liable to make decisions based on surface statistics which fail to generalize. We take the BabyAI environment, a grid world in which language-conditioned navigation tasks are performed, and build a sequence modeling Transformer, which takes a language instruction, a sequence of actions, and environmental observations as its inputs. In order to investigate the emergence of abstract state representations, we design a "blindfolded" navigation task, where only the initial environmental layout, the language instruction, and the action sequence to complete the task are available for training. Our probing results show that intermediate environmental layouts can be reasonably reconstructed from the internal activations of a trained model, and that language instructions play a role in the reconstruction accuracy. Our results suggest that many key features of state representations can emerge via embodied sequence modeling, supporting an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.

Related papers

Learning Compositional Behaviors from Demonstration and Language [28.352574199884852]
BLADE is a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning.<n>We show significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals.
arXiv Detail & Related papers (2025-05-28T05:19:59Z)
Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [83.21177515180564]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z)
SymPlanner: Deliberate Planning in Language Models with Symbolic Representation [0.9374652839580183]
We introduce SymPlanner, a novel framework that equips language models with structured planning capabilities.<n>SymPlanner grounds the planning process in a symbolic state space, where a policy model proposes actions and a symbolic environment deterministically executes and verifies their effects.<n>We evaluate SymPlanner on PlanBench, demonstrating that it produces more coherent, diverse, and verifiable plans than pure natural language baselines.
arXiv Detail & Related papers (2025-05-02T15:18:03Z)
Symbolic Representation for Any-to-Any Generative Tasks [25.808462395329194]
We propose a symbolic generative task description language and an inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility.
arXiv Detail & Related papers (2025-04-24T05:35:47Z)
Transformers Use Causal World Models in Maze-Solving Tasks [49.67445252528868]
We identify World Models in transformers trained on maze-solving tasks. We find that it is easier to activate features than to suppress them. positional encoding schemes appear to influence how World Models are structured within the model's residual stream.
arXiv Detail & Related papers (2024-12-16T15:21:04Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
Learning with Language-Guided State Abstractions [58.199148890064826]
Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations. Our method, LGA, uses a combination of natural language supervision and background knowledge from language models to automatically build state representations tailored to unseen tasks. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time.
arXiv Detail & Related papers (2024-02-28T23:57:04Z)
Emergence and Function of Abstract Representations in Self-Supervised Transformers [0.0]
We study the inner workings of small-scale transformers trained to reconstruct partially masked visual scenes. We show that the network develops intermediate abstract representations, or abstractions, that encode all semantic features of the dataset. Using precise manipulation experiments, we demonstrate that abstractions are central to the network's decision-making process.
arXiv Detail & Related papers (2023-12-08T20:47:15Z)
Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
Online Learning of Reusable Abstract Models for Object Goal Navigation [18.15382773079023]
We present a novel approach to incrementally learn an Abstract Model of an unknown environment. We show how an agent can reuse the learned model for tackling the Object Goal Navigation task.
arXiv Detail & Related papers (2022-03-04T21:44:43Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks. In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z)
S2RMs: Spatially Structured Recurrent Modules [105.0377129434636]
We take a step towards exploiting dynamic structure that are capable of simultaneously exploiting both modular andtemporal structures. We find our models to be robust to the number of available views and better capable of generalization to novel tasks without additional training.
arXiv Detail & Related papers (2020-07-13T17:44:30Z)
Learning intuitive physics and one-shot imitation using state-action-prediction self-organizing maps [0.0]
Humans learn by exploration and imitation, build causal models of the world, and use both to flexibly solve new tasks. We suggest a simple but effective unsupervised model which develops such characteristics. We demonstrate its performance on a set of several related, but different one-shot imitation tasks, which the agent flexibly solves in an active inference style.
arXiv Detail & Related papers (2020-07-03T12:29:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.