Related papers: LanGWM: Language Grounded World Model

LanGWM: Language Grounded World Model

URL: http://arxiv.org/abs/2311.17593v1
Date: Wed, 29 Nov 2023 12:41:55 GMT
Title: LanGWM: Language Grounded World Model
Authors: Rudra P.K. Poudel, Harit Pandya, Chao Zhang, Roberto Cipolla
Abstract summary: We focus on learning language-grounded visual features to enhance the world model learning. Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
Score: 24.86620763902546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.

Related papers

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments [0.0]
This work proposes a vision-language integration framework that unifies pre-trained visual encoders and large language models.<n>The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics.
arXiv Detail & Related papers (2025-10-29T01:16:21Z)
Learning Compositional Behaviors from Demonstration and Language [28.352574199884852]
BLADE is a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning.<n>We show significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals.
arXiv Detail & Related papers (2025-05-28T05:19:59Z)
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface [25.898592418636603]
ours is a framework that textbfUnifies textbfFine-grained visual perception tasks through an textbfOpen-ended language interface. ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies.
arXiv Detail & Related papers (2025-03-03T09:27:24Z)
ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers. ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution. We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z)
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling [47.7950860342515]
LexiContrastive Grounding (LCG) is a grounded language learning procedure that leverages visual supervision to improve textual representations. LCG outperforms standard language-only models in learning efficiency. It improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.
arXiv Detail & Related papers (2024-03-21T16:52:01Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs) Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. We introduce a framework for language-driven representation learning from human videos and captions. We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z)
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection [3.785123406103386]
We take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection. We propose a new mechanism called multimodal knowledge learning (textbfMKL), which is required to learn knowledge from language supervision.
arXiv Detail & Related papers (2022-05-09T07:03:30Z)
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images. "vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.