SOAR: Scene-debiasing Open-set Action Recognition
- URL: http://arxiv.org/abs/2309.01265v1
- Date: Sun, 3 Sep 2023 20:20:48 GMT
- Title: SOAR: Scene-debiasing Open-set Action Recognition
- Authors: Yuanhao Zhai, Ziyi Liu, Zhenyu Wu, Yi Wu, Chunluan Zhou, David
Doermann, Junsong Yuan, Gang Hua
- Abstract summary: We propose Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning.
The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information.
- Score: 81.8198917049666
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep learning models have a risk of utilizing spurious clues to make
predictions, such as recognizing actions based on the background scene. This
issue can severely degrade the open-set action recognition performance when the
testing samples have different scene distributions from the training samples.
To mitigate this problem, we propose a novel method, called Scene-debiasing
Open-set Action Recognition (SOAR), which features an adversarial scene
reconstruction module and an adaptive adversarial scene classification module.
The former prevents the decoder from reconstructing the video background given
video features, and thus helps reduce the background information in feature
learning. The latter aims to confuse scene type classification given video
features, with a specific emphasis on the action foreground, and helps to learn
scene-invariant information. In addition, we design an experiment to quantify
the scene bias. The results indicate that the current open-set action
recognizers are biased toward the scene, and our proposed SOAR method better
mitigates such bias. Furthermore, our extensive experiments demonstrate that
our method outperforms state-of-the-art methods, and the ablation studies
confirm the effectiveness of our proposed modules.
Related papers
- Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection [32.843848754881364]
Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos.
Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories.
We propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models.
arXiv Detail & Related papers (2024-11-17T00:39:59Z) - Cutup and Detect: Human Fall Detection on Cutup Untrimmed Videos Using a
Large Foundational Video Understanding Model [0.0]
This work explores the performance of a large video understanding foundation model on the downstream task of human fall detection on untrimmed video.
A method for temporal action localization that relies on a simple cutup of untrimmed videos is demonstrated.
The results are promising for real-time application, and the falls are detected on video level with a state-of-the-art 0.96 F1 score on the HQFSD dataset.
arXiv Detail & Related papers (2024-01-29T16:37:00Z) - DEVIAS: Learning Disentangled Video Representations of Action and Scene [3.336126457178601]
Video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data.
We propose a disentangling encoder-decoder architecture to learn disentangled action and scene representations with a single model.
We rigorously validate the proposed method on the UCF-101, Kinetics-400, and HVU datasets for the seen, and the SCUBA, HAT, and HVU datasets for unseen action-scene combination scenarios.
arXiv Detail & Related papers (2023-11-30T18:58:44Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning [16.72734794723157]
unsupervised video domain adaptation methods have been proposed to adapt a predictive model from a labelled dataset to an unlabelled dataset.
We address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source.
We propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data.
arXiv Detail & Related papers (2023-01-09T13:16:50Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - Evidential Deep Learning for Open Set Action Recognition [36.350348194248014]
We formulate the action recognition problem from the evidential deep learning (EDL) perspective.
We propose a plug-and-play module to debias the learned representation through contrastive learning.
arXiv Detail & Related papers (2021-07-21T15:45:37Z) - Robust Unsupervised Video Anomaly Detection by Multi-Path Frame
Prediction [61.17654438176999]
We propose a novel and robust unsupervised video anomaly detection method by frame prediction with proper design.
Our proposed method obtains the frame-level AUROC score of 88.3% on the CUHK Avenue dataset.
arXiv Detail & Related papers (2020-11-05T11:34:12Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.