Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2305.16602v2
- Date: Fri, 3 May 2024 14:01:22 GMT
- Title: Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
- Authors: Sanjoy Kundu, Shubham Trehan, Sathyanarayanan N. Aakur,
- Abstract summary: We propose a two-step, neuro-symbolic framework called ALGO to infer activities in egocentric videos with limited supervision.
First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video.
Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework.
- Score: 6.253919624802853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to infer labels in an open world, i.e., in an environment where the target ``labels'' are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label's search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. We also show that ALGO can be extended to zero-shot inference and demonstrate its competitive performance on the Charades-Ego dataset.
Related papers
- ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition [6.253919624802853]
We propose a neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition.
First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video.
Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework.
arXiv Detail & Related papers (2024-06-09T10:30:04Z) - SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes [23.284478293459856]
Action-slot is a slot attention-based approach that learns visual action-centric representations.
Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur.
To address the limitation, we collect a synthetic dataset called TACO, which is four times larger than OATS.
arXiv Detail & Related papers (2023-11-29T05:28:05Z) - Less is More: Toward Zero-Shot Local Scene Graph Generation via
Foundation Models [16.08214739525615]
We present a new task called Local Scene Graph Generation.
It aims to abstract pertinent structural information with partial objects and their relationships in an image.
We introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework harnessing foundation models renowned for their powerful perception and commonsense reasoning.
arXiv Detail & Related papers (2023-10-02T17:19:04Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Open Long-Tailed Recognition in a Dynamic World [82.91025831618545]
Real world data often exhibits a long-tailed and open-ended (with unseen classes) distribution.
A practical recognition system must balance between majority (head) and minority (tail) classes, generalize across the distribution, and acknowledge novelty upon the instances of unseen classes (open classes)
We define Open Long-Tailed Recognition++ as learning from such naturally distributed data and optimizing for the classification accuracy over a balanced test set.
arXiv Detail & Related papers (2022-08-17T15:22:20Z) - Opening up Open-World Tracking [62.12659607088812]
We propose and study Open-World Tracking (OWT)
This paper is the formalization of the OWT task, along with an evaluation protocol and metric (OWTA)
We show that our Open-World Tracking Baseline, while performing well in the OWT setting, also achieves near state-of-the-art results on traditional closed-world benchmarks.
arXiv Detail & Related papers (2021-04-22T17:58:15Z) - Towards Open World Object Detection [68.79678648726416]
ORE: Open World Object Detector is based on contrastive clustering and energy based unknown identification.
We find that identifying and characterizing unknown instances helps to reduce confusion in an incremental object detection setting.
arXiv Detail & Related papers (2021-03-03T18:58:18Z) - Knowledge Guided Learning: Towards Open Domain Egocentric Action
Recognition with Zero Supervision [5.28539620288341]
We show that attention and commonsense knowledge can be used to enable the self-supervised discovery of novel actions in egocentric videos.
We show that our approach can infer and learn novel classes for open vocabulary classification in egocentric videos.
arXiv Detail & Related papers (2020-09-16T04:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.