MOCA: A Modular Object-Centric Approach for Interactive Instruction
Following
- URL: http://arxiv.org/abs/2012.03208v1
- Date: Sun, 6 Dec 2020 07:59:22 GMT
- Title: MOCA: A Modular Object-Centric Approach for Interactive Instruction
Following
- Authors: Kunal Pratap Singh, Suvaansh Bhambri, Byeonghwi Kim, Roozbeh Mottaghi,
Jonghyun Choi
- Abstract summary: We propose a modular architecture that decouples the task into visual perception and action policy.
We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts.
- Score: 19.57344182656879
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Performing simple household tasks based on language directives is very
natural to humans, yet it remains an open challenge for an AI agent. Recently,
an `interactive instruction following' task has been proposed to foster
research in reasoning over long instruction sequences that requires object
interactions in a simulated environment. It involves solving open problems in
vision, language and navigation literature at each step. To address this
multifaceted problem, we propose a modular architecture that decouples the task
into visual perception and action policy, and name it as MOCA, a Modular
Object-Centric Approach. We evaluate our method on the ALFRED benchmark and
empirically validate that it outperforms prior arts by significant margins in
all metrics with good generalization performance (high success rate in unseen
environments). Our code is available at https://github.com/gistvision/moca.
Related papers
- ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.
We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Enabling robots to follow abstract instructions and complete complex dynamic tasks [4.514939211420443]
We present a novel framework that combines Large Language Models, a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF)
Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties.
Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository.
arXiv Detail & Related papers (2024-06-17T05:55:35Z) - Scaling Instructable Agents Across Many Simulated Worlds [70.97268311053328]
Our goal is to develop an agent that can accomplish anything a human can do in any simulated 3D environment.
Our approach focuses on language-driven generality while imposing minimal assumptions.
Our agents interact with environments in real-time using a generic, human-like interface.
arXiv Detail & Related papers (2024-03-13T17:50:32Z) - NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models [16.50443396055173]
We propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object navigation.
We first unleash the reasoning abilities of large language models to extract proposed objects from natural language instructions.
We then leverage the generalizability of large vision language models to actively discover and detect candidate objects from the scene.
arXiv Detail & Related papers (2024-02-16T13:21:33Z) - Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication.
We look at the application of building CS corpora.
We make the task more realistic by scaling it to more languages.
We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.