SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action
  Recognition Challenge 2021
        - URL: http://arxiv.org/abs/2110.02902v1
- Date: Wed, 6 Oct 2021 16:29:47 GMT
- Title: SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action
  Recognition Challenge 2021
- Authors: Swathikiran Sudhakaran and Adrian Bulat and Juan-Manuel Perez-Rua and
  Alex Falcon and Sergio Escalera and Oswald Lanz and Brais Martinez and
  Georgios Tzimiropoulos
- Abstract summary: This report presents the technical details of our submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021.
Our submission, visible on the public leaderboard, achieved a top-1 action recognition accuracy of 44.82%, using only RGB.
- Score: 80.05652375838073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   This report presents the technical details of our submission to the
EPIC-Kitchens-100 Action Recognition Challenge 2021. To participate in the
challenge we deployed spatio-temporal feature extraction and aggregation models
we have developed recently: GSF and XViT. GSF is an efficient spatio-temporal
feature extracting module that can be plugged into 2D CNNs for video action
recognition. XViT is a convolution free video feature extractor based on
transformer architecture. We design an ensemble of GSF and XViT model families
with different backbones and pretraining to generate the prediction scores. Our
submission, visible on the public leaderboard, achieved a top-1 action
recognition accuracy of 44.82%, using only RGB.
 
      
        Related papers
        - Finger in Camera Speaks Everything: Unconstrained Air-Writing for   Real-World [45.972735599458446]
 We present the groundbreaking air-writing Chinese character video dataset (AWCV-100K-UCAS2024)
This dataset captures handwritten trajectories in various real-world scenarios using commonly accessible RGB cameras.
We also introduce our baseline approach, the video-based character recognizer (VCRec)
 arXiv  Detail & Related papers  (2024-12-27T09:04:04Z)
- Early Action Recognition with Action Prototypes [62.826125870298306]
 We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
 arXiv  Detail & Related papers  (2023-12-11T18:31:13Z)
- MITFAS: Mutual Information based Temporal Feature Alignment and Sampling
  for Aerial Video Action Recognition [59.905048445296906]
 We present a novel approach for action recognition in UAV videos.
We use the concept of mutual information to compute and align the regions corresponding to human action or motion in the temporal domain.
In practice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods.
 arXiv  Detail & Related papers  (2023-03-05T04:05:17Z)
- It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
  Video Transformer Pre-training [76.69480467101143]
 Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
 arXiv  Detail & Related papers  (2022-10-11T08:05:18Z)
- NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation
  Challenge 2022 [13.603712913129506]
 We describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge.
Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction.
By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.
 arXiv  Detail & Related papers  (2022-06-22T06:34:58Z)
- Anticipative Video Transformer [105.20878510342551]
 Anticipative Video Transformer (AVT) is an end-to-end attention-based video modeling architecture.
We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features.
 arXiv  Detail & Related papers  (2021-06-03T17:57:55Z)
- FBK-HUPBA Submission to the EPIC-Kitchens Action Recognition 2020
  Challenge [43.8525418821458]
 We describe the technical details of our submission to the EPIC-Kitchens Action Recognition 2020 Challenge.
Our submission achieved top Ego-1 action recognition accuracy of 40.0% on S1 setting, and 21% on S2 setting, using only RGB.
 arXiv  Detail & Related papers  (2020-06-24T13:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.