Egocentric Action Recognition by Video Attention and Temporal Context
- URL: http://arxiv.org/abs/2007.01883v1
- Date: Fri, 3 Jul 2020 18:00:32 GMT
- Title: Egocentric Action Recognition by Video Attention and Temporal Context
- Authors: Juan-Manuel Perez-Rua, Antoine Toisoul, Brais Martinez, Victor
Escorcia, Li Zhang, Xiatian Zhu, Tao Xiang
- Abstract summary: We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge.
In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip.
Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
- Score: 83.57475598382146
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present the submission of Samsung AI Centre Cambridge to the CVPR2020
EPIC-Kitchens Action Recognition Challenge. In this challenge, action
recognition is posed as the problem of simultaneously predicting a single
`verb' and `noun' class label given an input trimmed video clip. That is, a
`verb' and a `noun' together define a compositional `action' class. The
challenging aspects of this real-life action recognition task include small
fast moving objects, complex hand-object interactions, and occlusions. At the
core of our submission is a recently-proposed spatial-temporal video attention
model, called `W3' (`What-Where-When') attention~\cite{perez2020knowing}. We
further introduce a simple yet effective contextual learning mechanism to model
`action' class scores directly from long-term temporal behaviour based on the
`verb' and `noun' prediction scores. Our solution achieves strong performance
on the challenge metrics without using object-specific reasoning nor extra
training data. In particular, our best solution with multimodal ensemble
achieves the 2$^{nd}$ best position for `verb', and 3$^{rd}$ best for `noun'
and `action' on the Seen Kitchens test set.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Stacked Temporal Attention: Improving First-person Action Recognition by
Emphasizing Discriminative Clips [39.29955809641396]
Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process.
Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video.
We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
arXiv Detail & Related papers (2021-12-02T08:02:35Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.