Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge
- URL: http://arxiv.org/abs/2310.15066v1
- Date: Mon, 23 Oct 2023 16:14:05 GMT
- Title: Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge
- Authors: Te-Lin Wu, Yu Zhou, Nanyun Peng
- Abstract summary: The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
- Score: 62.981429762309226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to actively ground task instructions from an egocentric view is
crucial for AI agents to accomplish tasks or assist humans virtually. One
important step towards this goal is to localize and track key active objects
that undergo major state change as a consequence of human actions/interactions
to the environment without being told exactly what/where to ground (e.g.,
localizing and tracking the `sponge` in video from the instruction "Dip the
`sponge` into the bucket."). While existing works approach this problem from a
pure vision perspective, we investigate to which extent the textual modality
(i.e., task instructions) and their interaction with visual modality can be
beneficial. Specifically, we propose to improve phrase grounding models'
ability on localizing the active objects by: (1) learning the role of `objects
undergoing change` and extracting them accurately from the instructions, (2)
leveraging pre- and post-conditions of the objects during actions, and (3)
recognizing the objects more robustly with descriptional knowledge. We leverage
large language models (LLMs) to extract the aforementioned action-object
knowledge, and design a per-object aggregation masking technique to effectively
perform joint inference on object phrases and symbolic knowledge. We evaluate
our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments
demonstrate the effectiveness of our proposed framework, which leads to>54%
improvements in all standard metrics on the TREK-150-OPE-Det localization +
tracking task, >7% improvements in all standard metrics on the TREK-150-OPE
tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD
task.
Related papers
- TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object
Detection [21.11998015053674]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.