Opening the Vocabulary of Egocentric Actions
- URL: http://arxiv.org/abs/2308.11488v2
- Date: Tue, 12 Dec 2023 15:10:15 GMT
- Title: Opening the Vocabulary of Egocentric Actions
- Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
- Abstract summary: This paper proposes a novel open vocabulary action recognition task.
Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects.
We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets.
- Score: 42.94865322371628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human actions in egocentric videos are often hand-object interactions
composed from a verb (performed by the hand) applied to an object. Despite
their extensive scaling up, egocentric datasets still face two limitations -
sparsity of action compositions and a closed set of interacting objects. This
paper proposes a novel open vocabulary action recognition task. Given a set of
verbs and objects observed during training, the goal is to generalize the verbs
to an open vocabulary of actions with seen and novel objects. To this end, we
decouple the verb and object predictions via an object-agnostic verb encoder
and a prompt-based object encoder. The prompting leverages CLIP representations
to predict an open vocabulary of interacting objects. We create open vocabulary
benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas
closed-action methods fail to generalize, our proposed method is effective. In
addition, our object encoder significantly outperforms existing open-vocabulary
visual recognition methods in recognizing novel interacting objects.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework.
It simultaneously performs Detection And Interaction Reasoning in one stage.
We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Modelling Spatio-Temporal Interactions for Compositional Action
Recognition [21.8767024220287]
Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed.
We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset.
Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets.
arXiv Detail & Related papers (2023-05-04T09:37:45Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - Object-agnostic Affordance Categorization via Unsupervised Learning of
Graph Embeddings [6.371828910727037]
Acquiring knowledge about object interactions and affordances can facilitate scene understanding and human-robot collaboration tasks.
We address the problem of affordance categorization for class-agnostic objects with an open set of interactions.
A novel depth-informed qualitative spatial representation is proposed for the construction of Activity Graphs.
arXiv Detail & Related papers (2023-03-30T15:04:04Z) - Disentangled Action Recognition with Knowledge Bases [77.77482846456478]
We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns.
Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale.
We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
arXiv Detail & Related papers (2022-07-04T20:19:13Z) - Learning Using Privileged Information for Zero-Shot Action Recognition [15.9032110752123]
This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap.
Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-06-17T08:46:09Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.