Agent AI: Surveying the Horizons of Multimodal Interaction
- URL: http://arxiv.org/abs/2401.03568v2
- Date: Thu, 25 Jan 2024 21:20:27 GMT
- Title: Agent AI: Surveying the Horizons of Multimodal Interaction
- Authors: Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park,
Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi,
Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao
- Abstract summary: "Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
- Score: 83.18367129924997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal AI systems will likely become a ubiquitous presence in our
everyday lives. A promising approach to making these systems more interactive
is to embody them as agents within physical and virtual environments. At
present, systems leverage existing foundation models as the basic building
blocks for the creation of embodied agents. Embedding agents within such
environments facilitates the ability of models to process and interpret visual
and contextual data, which is critical for the creation of more sophisticated
and context-aware AI systems. For example, a system that can perceive user
actions, human behavior, environmental objects, audio expressions, and the
collective sentiment of a scene can be used to inform and direct agent
responses within the given environment. To accelerate research on agent-based
multimodal intelligence, we define "Agent AI" as a class of interactive systems
that can perceive visual stimuli, language inputs, and other
environmentally-grounded data, and can produce meaningful embodied actions. In
particular, we explore systems that aim to improve agents based on
next-embodied action prediction by incorporating external knowledge,
multi-sensory inputs, and human feedback. We argue that by developing agentic
AI systems in grounded environments, one can also mitigate the hallucinations
of large foundation models and their tendency to generate environmentally
incorrect outputs. The emerging field of Agent AI subsumes the broader embodied
and agentic aspects of multimodal interactions. Beyond agents acting and
interacting in the physical world, we envision a future where people can easily
create any virtual reality or simulated scene and interact with agents embodied
within the virtual environment.
Related papers
- EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment [38.14321677323052]
Embodied artificial intelligence emphasizes the role of an agent's body in generating human-like behaviors.
In this paper, we construct a benchmark platform for embodied intelligence evaluation in real-world city environments.
arXiv Detail & Related papers (2024-10-12T17:49:26Z) - HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.42274173122328]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions.
We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education)
Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z) - Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions [68.92637077909693]
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment.
A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content.
Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions.
arXiv Detail & Related papers (2024-08-05T15:16:22Z) - Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions.
In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Signifiers as a First-class Abstraction in Hypermedia Multi-Agent
Systems [0.6595290783361959]
We build on concepts and methods from Affordance Theory and Human-Computer Interaction to introduce signifiers as a first-class abstraction in Web-based Multi-Agent Systems.
We define a formal model for the contextual exposure of signifiers in hypermedia environments that aims to drive affordance exploitation.
arXiv Detail & Related papers (2023-02-14T10:54:46Z) - Creating Multimodal Interactive Agents with Imitation and
Self-Supervised Learning [20.02604302565522]
A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language.
Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment.
We show that imitation learning of human-human interactions in a simulated world, in conjunction with self-supervised learning, is sufficient to produce a multimodal interactive agent, which we call MIA, that successfully interacts with non-adversarial humans 75% of the time.
arXiv Detail & Related papers (2021-12-07T15:17:27Z) - Imitating Interactive Intelligence [24.95842455898523]
We study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment.
To build agents that can robustly interact with humans, we would ideally train them while they interact with humans.
We use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour.
arXiv Detail & Related papers (2020-12-10T13:55:47Z) - ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [75.0278287071591]
ThreeDWorld (TDW) is a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments.
We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science.
arXiv Detail & Related papers (2020-07-09T17:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.