Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks
- URL: http://arxiv.org/abs/2106.00596v1
- Date: Tue, 1 Jun 2021 16:06:09 GMT
- Title: Look Wide and Interpret Twice: Improving Performance on Interactive
Instruction-following Tasks
- Authors: Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani
- Abstract summary: Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task.
This paper proposes a new method, which outperforms the previous methods by a large margin.
- Score: 29.671268927569063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a growing interest in the community in making an embodied AI agent
perform a complicated task while interacting with an environment following
natural language directives. Recent studies have tackled the problem using
ALFRED, a well-designed dataset for the task, but achieved only very low
accuracy. This paper proposes a new method, which outperforms the previous
methods by a large margin. It is based on a combination of several new ideas.
One is a two-stage interpretation of the provided instructions. The method
first selects and interprets an instruction without using visual information,
yielding a tentative action sequence prediction. It then integrates the
prediction with the visual information etc., yielding the final prediction of
an action and an object. As the object's class to interact is identified in the
first stage, it can accurately select the correct object from the input image.
Moreover, our method considers multiple egocentric views of the environment and
extracts essential information by applying hierarchical attention conditioned
on the current instruction. This contributes to the accurate prediction of
actions for navigation. A preliminary version of the method won the ALFRED
Challenge 2020. The current version achieves the unseen environment's success
rate of 4.45% with a single view, which is further improved to 8.37% with
multiple views.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation [14.188006024550257]
We study the short-term object interaction anticipation problem from the egocentric point of view.
Our approach simultaneously processes a still image and a video detecting and localizing next-active objects.
Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022.
arXiv Detail & Related papers (2023-04-08T09:01:37Z) - Fine-Grained Visual Entailment [51.66881737644983]
We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
arXiv Detail & Related papers (2022-03-29T16:09:38Z) - Explain and Predict, and then Predict Again [6.865156063241553]
We propose ExPred, that uses multi-task learning in the explanation generation phase effectively trading-off explanation and prediction losses.
We conduct an extensive evaluation of our approach on three diverse language datasets.
arXiv Detail & Related papers (2021-01-11T19:36:52Z) - Learning Long-term Visual Dynamics with Region Proposal Interaction
Networks [75.06423516419862]
We build object representations that can capture inter-object and object-environment interactions over a long-range.
Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin.
arXiv Detail & Related papers (2020-08-05T17:48:00Z) - Object-and-Action Aware Model for Visual Language Navigation [70.33142095637515]
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
arXiv Detail & Related papers (2020-07-29T06:32:18Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.