Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge
- URL: http://arxiv.org/abs/2310.15066v1
- Date: Mon, 23 Oct 2023 16:14:05 GMT
- Title: Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge
- Authors: Te-Lin Wu, Yu Zhou, Nanyun Peng
- Abstract summary: The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
- Score: 62.981429762309226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to actively ground task instructions from an egocentric view is
crucial for AI agents to accomplish tasks or assist humans virtually. One
important step towards this goal is to localize and track key active objects
that undergo major state change as a consequence of human actions/interactions
to the environment without being told exactly what/where to ground (e.g.,
localizing and tracking the `sponge` in video from the instruction "Dip the
`sponge` into the bucket."). While existing works approach this problem from a
pure vision perspective, we investigate to which extent the textual modality
(i.e., task instructions) and their interaction with visual modality can be
beneficial. Specifically, we propose to improve phrase grounding models'
ability on localizing the active objects by: (1) learning the role of `objects
undergoing change` and extracting them accurately from the instructions, (2)
leveraging pre- and post-conditions of the objects during actions, and (3)
recognizing the objects more robustly with descriptional knowledge. We leverage
large language models (LLMs) to extract the aforementioned action-object
knowledge, and design a per-object aggregation masking technique to effectively
perform joint inference on object phrases and symbolic knowledge. We evaluate
our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments
demonstrate the effectiveness of our proposed framework, which leads to>54%
improvements in all standard metrics on the TREK-150-OPE-Det localization +
tracking task, >7% improvements in all standard metrics on the TREK-150-OPE
tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD
task.
Related papers
- Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Visual Grounding for Object-Level Generalization in Reinforcement Learning [35.39214541324909]
Generalization is a pivotal challenge for agents following natural language instructions.
We leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning.
We show that our intrinsic reward significantly improves performance on challenging skill learning.
arXiv Detail & Related papers (2024-08-04T06:34:24Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.