EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action
Recognition 2022: Team HNU-FPV Technical Report
- URL: http://arxiv.org/abs/2207.03095v1
- Date: Thu, 7 Jul 2022 05:27:32 GMT
- Title: EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action
Recognition 2022: Team HNU-FPV Technical Report
- Authors: Nie Lin, Minjie Cai
- Abstract summary: We present our submission to the 2022 EPIC-Kitchens Unsupervised Domain Adaptation Challenge.
Our method ranks 4th among this year's teams on the test set of EPIC-KITCHENS-100.
- Score: 4.88605334919407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present the technical details of our submission to the
2022 EPIC-Kitchens Unsupervised Domain Adaptation (UDA) Challenge. Existing UDA
methods align the global features extracted from the whole video clips across
the source and target domains but suffer from the spatial redundancy of feature
matching in video recognition. Motivated by the observation that in most cases
a small image region in each video frame can be informative enough for the
action recognition task, we propose to exploit informative image regions to
perform efficient domain alignment. Specifically, we first use lightweight CNNs
to extract the global information of the input two-stream video frames and
select the informative image patches by a differentiable interpolation-based
selection strategy. Then the global information from videos frames and local
information from image patches are processed by an existing video adaptation
method, i.e., TA3N, in order to perform feature alignment for the source domain
and the target domain. Our method (without model ensemble) ranks 4th among this
year's teams on the test set of EPIC-KITCHENS-100.
Related papers
- Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action
Recognition [52.78234467516168]
We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames.
We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function.
Our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.
arXiv Detail & Related papers (2023-04-14T00:01:11Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Unsupervised Domain Adaptation for Video Transformers in Action
Recognition [76.31442702219461]
We propose a simple and novel UDA approach for video action recognition.
Our approach builds a robust source model that better generalises to target domain.
We report results on two video action benchmarks recognition for UDA.
arXiv Detail & Related papers (2022-07-26T12:17:39Z) - Team VI-I2R Technical Report on EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognition 2021 [6.614021153407064]
The EPIC-KITCHENS-100 dataset consists of daily kitchen activities focusing on the interaction between human hands and their surrounding objects.
It is very challenging to accurately recognize these fine-grained activities, due to the presence of distracting objects and visually similar action classes.
We propose to learn hand-centric features by leveraging the hand bounding box information for UDA on fine-grained action recognition.
Our submission achieved the 1st place in terms of top-1 action recognition accuracy, using only RGB and optical flow modalities as input.
arXiv Detail & Related papers (2022-06-03T07:37:48Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID)
Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately.
In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z) - Learning Cross-modal Contrastive Features for Video Domain Adaptation [138.75196499580804]
We propose a unified framework for video domain adaptation, which simultaneously regularizes cross-modal and cross-domain feature representations.
Specifically, we treat each modality in a domain as a view and leverage the contrastive learning technique with properly designed sampling strategies.
arXiv Detail & Related papers (2021-08-26T18:14:18Z) - DRIV100: In-The-Wild Multi-Domain Dataset and Evaluation for Real-World
Domain Adaptation of Semantic Segmentation [9.984696742463628]
This work presents a new multi-domain dataset datasetnamefor benchmarking domain adaptation techniques on in-the-wild road-scene videos collected from the Internet.
The dataset consists of pixel-level annotations for 100 videos selected to cover diverse scenes/domains based on two criteria; human subjective judgment and an anomaly score judged using an existing road-scene dataset.
arXiv Detail & Related papers (2021-01-30T04:43:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.