ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition
- URL: http://arxiv.org/abs/2406.05722v1
- Date: Sun, 9 Jun 2024 10:30:04 GMT
- Title: ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition
- Authors: Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur,
- Abstract summary: We propose a neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition.
First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video.
Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework.
- Score: 6.253919624802853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to infer labels in an open world, i.e., in an environment where the target "labels" are unknown, is an important characteristic for achieving autonomy. Foundation models pre-trained on enormous amounts of data have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space. In an open world, this target search space can be unknown or exceptionally large, which severely restricts the performance of such models. To tackle this challenging problem, we propose a neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision using two steps. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus) demonstrate its performance on open-world activity inference.
Related papers
- Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes [23.284478293459856]
Action-slot is a slot attention-based approach that learns visual action-centric representations.
Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur.
To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS.
arXiv Detail & Related papers (2023-11-29T05:28:05Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning [6.253919624802853]
We propose a two-step, neuro-symbolic framework called ALGO to infer activities in egocentric videos with limited supervision.
First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video.
Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework.
arXiv Detail & Related papers (2023-05-26T03:21:30Z) - Open Long-Tailed Recognition in a Dynamic World [82.91025831618545]
Real world data often exhibits a long-tailed and open-ended (with unseen classes) distribution.
A practical recognition system must balance between majority (head) and minority (tail) classes, generalize across the distribution, and acknowledge novelty upon the instances of unseen classes (open classes)
We define Open Long-Tailed Recognition++ as learning from such naturally distributed data and optimizing for the classification accuracy over a balanced test set.
arXiv Detail & Related papers (2022-08-17T15:22:20Z) - HAKE: A Knowledge Engine Foundation for Human Activity Understanding [65.24064718649046]
Human activity understanding is of widespread interest in artificial intelligence and spans diverse applications like health care and behavior analysis.
We propose a novel paradigm to reformulate this task in two stages: first mapping pixels to an intermediate space spanned by atomic activity primitives, then programming detected primitives with interpretable logic rules to infer semantics.
Our framework, the Human Activity Knowledge Engine (HAKE), exhibits superior generalization ability and performance upon challenging benchmarks.
arXiv Detail & Related papers (2022-02-14T16:38:31Z) - Opening up Open-World Tracking [62.12659607088812]
We propose and study Open-World Tracking (OWT)
This paper is the formalization of the OWT task, along with an evaluation protocol and metric (OWTA)
We show that our Open-World Tracking Baseline, while performing well in the OWT setting, also achieves near state-of-the-art results on traditional closed-world benchmarks.
arXiv Detail & Related papers (2021-04-22T17:58:15Z) - Towards Open World Object Detection [68.79678648726416]
ORE: Open World Object Detector is based on contrastive clustering and energy based unknown identification.
We find that identifying and characterizing unknown instances helps to reduce confusion in an incremental object detection setting.
arXiv Detail & Related papers (2021-03-03T18:58:18Z) - Knowledge Guided Learning: Towards Open Domain Egocentric Action
Recognition with Zero Supervision [5.28539620288341]
We show that attention and commonsense knowledge can be used to enable the self-supervised discovery of novel actions in egocentric videos.
We show that our approach can infer and learn novel classes for open vocabulary classification in egocentric videos.
arXiv Detail & Related papers (2020-09-16T04:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.