MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World
- URL: http://arxiv.org/abs/2401.08577v1
- Date: Tue, 16 Jan 2024 18:59:45 GMT
- Title: MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World
- Authors: Yining Hong, Zishuo Zheng, Peihao Chen, Yian Wang, Junyan Li, Chuang
Gan
- Abstract summary: We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
- Score: 55.878173953175356
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Human beings possess the capability to multiply a melange of multisensory
cues while actively exploring and interacting with the 3D world. Current
multi-modal large language models, however, passively absorb sensory data as
inputs, lacking the capacity to actively interact with the objects in the 3D
environment and dynamically collect their multisensory information. To usher in
the study of this area, we propose MultiPLY, a multisensory embodied large
language model that could incorporate multisensory interactive data, including
visual, audio, tactile, and thermal information into large language models,
thereby establishing the correlation among words, actions, and percepts. To
this end, we first collect Multisensory Universe, a large-scale multisensory
interaction dataset comprising 500k data by deploying an LLM-powered embodied
agent to engage with the 3D environment. To perform instruction tuning with
pre-trained LLM on such generated data, we first encode the 3D scene as
abstracted object-centric representations and then introduce action tokens
denoting that the embodied agent takes certain actions within the environment,
as well as state tokens that represent the multisensory state observations of
the agent at each time step. In the inference time, MultiPLY could generate
action tokens, instructing the agent to take the action in the environment and
obtain the next multisensory state observation. The observation is then
appended back to the LLM via state tokens to generate subsequent text or action
tokens. We demonstrate that MultiPLY outperforms baselines by a large margin
through a diverse set of embodied tasks involving object retrieval, tool use,
multisensory captioning, and task decomposition.
Related papers
- Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment.
We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation.
We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs [15.402143137362112]
Future interactive interfaces should intelligently provide quick access to digital actions based on users' context.
We generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs.
We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information.
arXiv Detail & Related papers (2024-05-06T23:11:00Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [75.0278287071591]
ThreeDWorld (TDW) is a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments.
We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science.
arXiv Detail & Related papers (2020-07-09T17:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.