Related papers: ArK: Augmented Reality with Knowledge Interactive Emergent Ability

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

URL: http://arxiv.org/abs/2305.00970v1
Date: Mon, 1 May 2023 17:57:01 GMT
Title: ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Authors: Qiuyuan Huang, Jae Sung Park, Abhinav Gupta, Paul Bennett, Ran Gong, Subhojit Som, Baolin Peng, Owais Khan Mohammed, Chris Pal, Yejin Choi, Jianfeng Gao
Abstract summary: We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains. The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK) We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
Score: 115.72679420999535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the growing adoption of mixed reality and interactive AI agents, it remains challenging for these systems to generate high quality 2D/3D scenes in unseen environments. The common practice requires deploying an AI agent to collect large amounts of data for model training for every new task. This process is costly, or even impossible, for many domains. In this study, we develop an infinite agent that learns to transfer knowledge memory from general foundation models (e.g. GPT4, DALLE) to novel domains or scenarios for scene understanding and generation in the physical or virtual world. The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK), which leverages knowledge-memory to generate scenes in unseen physical world and virtual reality environments. The knowledge interactive emergent ability (Figure 1) is demonstrated as the observation learns i) micro-action of cross-modality: in multi-modality models to collect a large amount of relevant knowledge memory data for each interaction task (e.g., unseen scene understanding) from the physical reality; and ii) macro-behavior of reality-agnostic: in mix-reality environments to improve interactions that tailor to different characterized roles, target variables, collaborative information, and so on. We validate the effectiveness of ArK on the scene generation and editing tasks. We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes, compared to baselines, demonstrating the potential benefit of incorporating ArK in generative AI for applications such as metaverse and gaming simulation.

Related papers

Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting [4.568580817155409]
Amodal completion is crucial for understanding human-object interactions in computer vision and robotics.<n>We develop a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI.<n>Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios.
arXiv Detail & Related papers (2025-08-01T08:33:45Z)
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI [37.47562766916571]
We introduce UnrealZoo, a rich collection of photo-realistic 3D virtual worlds built on Unreal Engine. We offer a variety of playable entities for embodied AI agents.
arXiv Detail & Related papers (2024-12-30T14:31:01Z)
Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z)
Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence [2.890656584329591]
Online Decision MetaMorphFormer (ODM) aims to achieve self-awareness, environment recognition, and action planning. ODM can be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets.
arXiv Detail & Related papers (2024-09-11T15:22:43Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data. We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z)
REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding. We present REACT, an architecture inspired by the transformer encoder-decoder model. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z)
Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models [21.13906762261418]
A long-standing challenge for a robotic manipulation system is adapting and generalizing its acquired motor skills to unseen environments. We tackle this challenge employing hybrid skill models that integrate imitation and reinforcement paradigms. We show that our method enables a robot to gain a significant zero-shot generalization to novel environments and to refine skills in the target environments faster than learning from scratch.
arXiv Detail & Related papers (2023-10-23T16:03:23Z)
Knowledge-enhanced Agents for Interactive Text Games [16.055119735473017]
We propose a knowledge-injection framework for improved functional grounding of agents in text-based games. We consider two forms of domain knowledge that we inject into learning-based agents: memory of previous correct actions and affordances of relevant objects in the environment. Our framework supports two representative model classes: reinforcement learning agents and language model agents.
arXiv Detail & Related papers (2023-05-08T23:31:39Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
Evaluating Continual Learning Algorithms by Generating 3D Virtual Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment. We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance. A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z)
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [75.0278287071591]
ThreeDWorld (TDW) is a platform for interactive multi-modal physical simulation. TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments. We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science.
arXiv Detail & Related papers (2020-07-09T17:33:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.