From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
- URL: http://arxiv.org/abs/2501.00296v3
- Date: Tue, 10 Jun 2025 03:08:29 GMT
- Title: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
- Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling,
- Abstract summary: We focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning.<n>A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects.<n>We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively.
- Score: 32.81048722407204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
Related papers
- Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives [36.297745473653166]
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language.<n>Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress.
arXiv Detail & Related papers (2025-05-20T13:47:40Z) - VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning [86.59849798539312]
We present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations.<n>We show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.
arXiv Detail & Related papers (2024-10-30T16:11:05Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.
We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.
Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - LIV: Language-Image Representations and Rewards for Robotic Control [37.12560985663822]
We present a unified objective for vision-language representation and reward learning from action-free videos with text annotations.
We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen.
Our results validate the advantages of joint vision-language representation and reward learning within the unified, compact LIV framework.
arXiv Detail & Related papers (2023-06-01T17:52:23Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Learning Universal Policies via Text-Guided Video Generation [179.6347119101618]
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks.
Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images.
We investigate whether such tools can be used to construct more general-purpose agents.
arXiv Detail & Related papers (2023-01-31T21:28:13Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Language Model-Based Paired Variational Autoencoders for Robotic Language Learning [18.851256771007748]
Similar to human infants, artificial agents can learn language while interacting with their environment.
We present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario.
Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model.
arXiv Detail & Related papers (2022-01-17T10:05:26Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.