Related papers: Egocentric Action Recognition by Video Attention and Temporal Context

Egocentric Action Recognition by Video Attention and Temporal Context

URL: http://arxiv.org/abs/2007.01883v1
Date: Fri, 3 Jul 2020 18:00:32 GMT
Title: Egocentric Action Recognition by Video Attention and Temporal Context
Authors: Juan-Manuel Perez-Rua, Antoine Toisoul, Brais Martinez, Victor Escorcia, Li Zhang, Xiatian Zhu, Tao Xiang
Abstract summary: We present the submission of Samsung AI Centre Cambridge to the CVPR 2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single verb' and noun' class label given an input trimmed video clip. Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data.
Score: 83.57475598382146
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present the submission of Samsung AI Centre Cambridge to the CVPR2020 EPIC-Kitchens Action Recognition Challenge. In this challenge, action recognition is posed as the problem of simultaneously predicting a single `verb' and `noun' class label given an input trimmed video clip. That is, a `verb' and a `noun' together define a compositional `action' class. The challenging aspects of this real-life action recognition task include small fast moving objects, complex hand-object interactions, and occlusions. At the core of our submission is a recently-proposed spatial-temporal video attention model, called `W3' (`What-Where-When') attention~\cite{perez2020knowing}. We further introduce a simple yet effective contextual learning mechanism to model `action' class scores directly from long-term temporal behaviour based on the `verb' and `noun' prediction scores. Our solution achieves strong performance on the challenge metrics without using object-specific reasoning nor extra training data. In particular, our best solution with multimodal ensemble achieves the 2$^{nd}$ best position for `verb', and 3$^{rd}$ best for `noun' and `action' on the Seen Kitchens test set.

Related papers

Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision. We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos. Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z)
Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z)
Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z)
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task. Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z)
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions. We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z)
Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video. We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction. We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z)
Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips [39.29955809641396]
Many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video. We propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips.
arXiv Detail & Related papers (2021-12-02T08:02:35Z)
A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video. We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design. In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z)
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos. It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.