Opening the Vocabulary of Egocentric Actions
- URL: http://arxiv.org/abs/2308.11488v2
- Date: Tue, 12 Dec 2023 15:10:15 GMT
- Title: Opening the Vocabulary of Egocentric Actions
- Authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
- Abstract summary: This paper proposes a novel open vocabulary action recognition task.
Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects.
We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets.
- Score: 42.94865322371628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human actions in egocentric videos are often hand-object interactions
composed from a verb (performed by the hand) applied to an object. Despite
their extensive scaling up, egocentric datasets still face two limitations -
sparsity of action compositions and a closed set of interacting objects. This
paper proposes a novel open vocabulary action recognition task. Given a set of
verbs and objects observed during training, the goal is to generalize the verbs
to an open vocabulary of actions with seen and novel objects. To this end, we
decouple the verb and object predictions via an object-agnostic verb encoder
and a prompt-based object encoder. The prompting leverages CLIP representations
to predict an open vocabulary of interacting objects. We create open vocabulary
benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas
closed-action methods fail to generalize, our proposed method is effective. In
addition, our object encoder significantly outperforms existing open-vocabulary
visual recognition methods in recognizing novel interacting objects.
Related papers
- Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO)
An object grounding task is proposed expecting vision systems to discover interacted objects.
We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z) - From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects [0.6262268096839562]
Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an unbounded vocabulary.
OVD relies on accurate prompts provided by an ''oracle'', which limits their use in critical applications such as driving scene perception.
We propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning novel objects.
arXiv Detail & Related papers (2024-11-27T10:33:51Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Modelling Spatio-Temporal Interactions for Compositional Action
Recognition [21.8767024220287]
Humans have the natural ability to recognize actions even if the objects involved in the action or the background are changed.
We show the effectiveness of our interaction-centric approach on the compositional Something-Else dataset.
Our approach of explicit human-object-stuff interaction modeling is effective even for standard action recognition datasets.
arXiv Detail & Related papers (2023-05-04T09:37:45Z) - Verbs in Action: Improving verb understanding in video-language models [128.87443209118726]
State-of-the-art video-language models based on CLIP have been shown to have limited verb understanding.
We improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive framework.
arXiv Detail & Related papers (2023-04-13T17:57:01Z) - Disentangled Action Recognition with Knowledge Bases [77.77482846456478]
We aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns.
Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale.
We propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions.
arXiv Detail & Related papers (2022-07-04T20:19:13Z) - Learning Using Privileged Information for Zero-Shot Action Recognition [15.9032110752123]
This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap.
Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-06-17T08:46:09Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.