ArK: Augmented Reality with Knowledge Interactive Emergent Ability
- URL: http://arxiv.org/abs/2305.00970v1
- Date: Mon, 1 May 2023 17:57:01 GMT
- Title: ArK: Augmented Reality with Knowledge Interactive Emergent Ability
- Authors: Qiuyuan Huang, Jae Sung Park, Abhinav Gupta, Paul Bennett, Ran Gong,
Subhojit Som, Baolin Peng, Owais Khan Mohammed, Chris Pal, Yejin Choi,
Jianfeng Gao
- Abstract summary: We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains.
The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK)
We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
- Score: 115.72679420999535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the growing adoption of mixed reality and interactive AI agents, it
remains challenging for these systems to generate high quality 2D/3D scenes in
unseen environments. The common practice requires deploying an AI agent to
collect large amounts of data for model training for every new task. This
process is costly, or even impossible, for many domains. In this study, we
develop an infinite agent that learns to transfer knowledge memory from general
foundation models (e.g. GPT4, DALLE) to novel domains or scenarios for scene
understanding and generation in the physical or virtual world. The heart of our
approach is an emerging mechanism, dubbed Augmented Reality with Knowledge
Inference Interaction (ArK), which leverages knowledge-memory to generate
scenes in unseen physical world and virtual reality environments. The knowledge
interactive emergent ability (Figure 1) is demonstrated as the observation
learns i) micro-action of cross-modality: in multi-modality models to collect a
large amount of relevant knowledge memory data for each interaction task (e.g.,
unseen scene understanding) from the physical reality; and ii) macro-behavior
of reality-agnostic: in mix-reality environments to improve interactions that
tailor to different characterized roles, target variables, collaborative
information, and so on. We validate the effectiveness of ArK on the scene
generation and editing tasks. We show that our ArK approach, combined with
large foundation models, significantly improves the quality of generated 2D/3D
scenes, compared to baselines, demonstrating the potential benefit of
incorporating ArK in generative AI for applications such as metaverse and
gaming simulation.
Related papers
- Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment.
We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation.
We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z) - Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence [2.890656584329591]
Online Decision MetaMorphFormer (ODM) aims to achieve self-awareness, environment recognition, and action planning.
ODM can be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets.
arXiv Detail & Related papers (2024-09-11T15:22:43Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z) - Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic
Gaussian Mixture Models [21.13906762261418]
A long-standing challenge for a robotic manipulation system is adapting and generalizing its acquired motor skills to unseen environments.
We tackle this challenge employing hybrid skill models that integrate imitation and reinforcement paradigms.
We show that our method enables a robot to gain a significant zero-shot generalization to novel environments and to refine skills in the target environments faster than learning from scratch.
arXiv Detail & Related papers (2023-10-23T16:03:23Z) - Knowledge-enhanced Agents for Interactive Text Games [16.055119735473017]
We propose a knowledge-injection framework for improved functional grounding of agents in text-based games.
We consider two forms of domain knowledge that we inject into learning-based agents: memory of previous correct actions and affordances of relevant objects in the environment.
Our framework supports two representative model classes: reinforcement learning agents and language model agents.
arXiv Detail & Related papers (2023-05-08T23:31:39Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z) - ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [75.0278287071591]
ThreeDWorld (TDW) is a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments.
We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science.
arXiv Detail & Related papers (2020-07-09T17:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.