Related papers: Opening the Vocabulary of Egocentric Actions

Opening the Vocabulary of Egocentric Actions

URL: http://arxiv.org/abs/2308.11488v2
Date: Tue, 12 Dec 2023 15:10:15 GMT
Title: Opening the Vocabulary of Egocentric Actions
Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
Abstract summary: This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets.
Score: 42.94865322371628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

Related papers

Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition [14.01593872543569]
We introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences.<n>We demonstrate the effectiveness of our approach on the challenging Action Genome and Charades datasets.
arXiv Detail & Related papers (2025-06-20T02:43:53Z)
Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO) An object grounding task is proposed expecting vision systems to discover interacted objects. We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z)
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects [0.6262268096839562]
Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary. OVD relies on accurate prompts provided by an oracle'', which limits their use in critical applications such as driving scene perception. We propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects.
arXiv Detail & Related papers (2024-11-27T10:33:51Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework. It simultaneously performs Detection And Interaction Reasoning in one stage. We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z)
Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z)
Modelling Spatio-Temporal Interactions for Compositional Action Recognition [21.8767024220287]
Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed. We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset. Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets.
arXiv Detail & Related papers (2023-05-04T09:37:45Z)
Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding. We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z)
Object-agnostic Affordance Categorization via Unsupervised Learning of Graph Embeddings [6.371828910727037]
Acquiring knowledge about object interactions and affordances can facilitate scene understanding and human-robot collaboration tasks. We address the problem of affordance categorization for class-agnostic objects with an open set of interactions. A novel depth-informed qualitative spatial representation is proposed for the construction of Activity Graphs.
arXiv Detail & Related papers (2023-03-30T15:04:04Z)
Disentangled Action Recognition with Knowledge Bases [77.77482846456478]
We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns. Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale. We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
arXiv Detail & Related papers (2022-07-04T20:19:13Z)
Learning Using Privileged Information for Zero-Shot Action Recognition [15.9032110752123]
This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-06-17T08:46:09Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.