OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
- URL: http://arxiv.org/abs/2407.00114v2
- Date: Thu, 31 Oct 2024 14:27:50 GMT
- Title: OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
- Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang,
- Abstract summary: This paper presents OmniJARVIS, a novel Vision-Language-Action model for open-world instruction-following agents in Minecraft.
Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act.
Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.
- Score: 26.416725410359064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the imitation learning policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https://craftjarvis.org/OmniJARVIS.
Related papers
- OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.
Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.
We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z) - Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy [50.13429055093534]
Optimus-2 is a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning.
We introduce a high-quality Minecraft Goal-Observation-Action (MGOA) dataset, which contains 25,000 videos across 8 atomic tasks.
Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.
arXiv Detail & Related papers (2025-02-27T09:18:04Z) - Doe-1: Closed-Loop Autonomous Driving with Large World Model [63.99937807085461]
We propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning.
We use free-form texts for perception and generate future predictions directly in the RGB space with image tokens.
For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens.
arXiv Detail & Related papers (2024-12-12T18:59:59Z) - ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.
We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Vision Language Models (VLMs) can process state information as visual-textual prompts and respond with policy decisions in text.
We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models [69.49978333446538]
TEAL is an approach to treat the input from any modality as a token sequence.
It embeds the token sequence into a joint embedding space with a learnable embedding matrix.
Experiments show that TEAL achieves substantial improvements in multi-modal understanding.
arXiv Detail & Related papers (2023-11-08T10:34:16Z) - Reinforced UI Instruction Grounding: Towards a Generic UI Task
Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor.
To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm.
Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple
Oracles [29.460712493470453]
We propose a new paradigm of zero-shot controllable text generation with multimodal signals (textscZeroGen)
textscZeroGen leverages controls of text and image successively from token-level to sentence-level and maps them into a unified probability space at decoding.
We show that textscZeroGen not only outperforms its counterparts on captioning tasks by a large margin but also shows great potential in multimodal news generation with a higher degree of control.
arXiv Detail & Related papers (2023-06-29T03:22:43Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.