Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries
- URL: http://arxiv.org/abs/2102.08065v1
- Date: Tue, 16 Feb 2021 10:26:04 GMT
- Title: Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries
- Authors: Swathikiran Sudhakaran and Sergio Escalera and Oswald Lanz
- Abstract summary: We present EgoACO, a deep neural architecture for video action recognition.
It learns to pool action-context-object descriptors from frame level features.
Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
- Score: 51.48859591280838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present EgoACO, a deep neural architecture for video action recognition
that learns to pool action-context-object descriptors from frame level features
by leveraging the verb-noun structure of action labels in egocentric video
datasets. The core component of EgoACO is class activation pooling (CAP), a
differentiable pooling operation that combines ideas from bilinear pooling for
fine-grained recognition and from feature learning for discriminative
localization. CAP uses self-attention with a dictionary of learnable weights to
pool from the most relevant feature regions. Through CAP, EgoACO learns to
decode object and scene context descriptors from video frame features. For
temporal modeling in EgoACO, we design a recurrent version of class activation
pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional
gated LSTM with built-in spatial attention and a re-designed output gate.
Action, object and context descriptors are fused by a multi-head prediction
that accounts for the inter-dependencies between noun-verb-action structured
labels in egocentric video datasets. EgoACO features built-in visual
explanations, helping learning and interpretation. Results on the two largest
egocentric action recognition datasets currently available, EPIC-KITCHENS and
EGTEA, show that by explicitly decoding action-context-object descriptors,
EgoACO achieves state-of-the-art recognition performance.
Related papers
- Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Action Scene Graphs for Long-Form Understanding of Egocentric Videos [23.058999979457546]
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos.
EASGs provide a temporally evolving graph-based description of the actions performed by the camera wearer.
We will release the dataset and the code to replicate experiments and annotations.
arXiv Detail & Related papers (2023-12-06T10:01:43Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and
Context Terms [18.857745441710076]
Self-Attention has become prevalent in computer vision models.
We propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions.
The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation.
arXiv Detail & Related papers (2021-07-12T18:00:00Z) - Egocentric Action Recognition by Video Attention and Temporal Context [83.57475598382146]
We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
arXiv Detail & Related papers (2020-07-03T18:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.