STMixer: A One-Stage Sparse Action Detector
- URL: http://arxiv.org/abs/2404.09842v1
- Date: Mon, 15 Apr 2024 14:52:02 GMT
- Title: STMixer: A One-Stage Sparse Action Detector
- Authors: Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang,
- Abstract summary: We propose two core designs for a more flexible one-stage action detector.
First, we sparse a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of features from the entire video-temporal domain.
Second, we devise a decoupled feature mixing module, which dynamically attends to mixes along the spatial and temporal dimensions respectively for better feature decoding.
- Score: 43.62159663367588
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
Related papers
- Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection [36.478530086163744]
We propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet.
MutDet fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction.
Experiments on various settings show new state-of-the-art transfer performance.
arXiv Detail & Related papers (2024-07-13T15:28:15Z) - STMixer: A One-Stage Sparse Action Detector [48.0614066856134]
We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
arXiv Detail & Related papers (2023-03-28T10:47:06Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection [8.39479809973967]
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce examples.
Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component.
We present that a general few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations.
arXiv Detail & Related papers (2020-11-30T10:21:32Z) - Joint Detection and Tracking in Videos with Identification Features [36.55599286568541]
We propose the first joint optimization of detection, tracking and re-identification features for videos.
Our method reaches the state-of-the-art on MOT, it ranks 1st in the UA-DETRAC'18 tracking challenge among online trackers, and 3rd overall.
arXiv Detail & Related papers (2020-05-21T21:06:40Z) - Spatio-Temporal Action Detection with Multi-Object Interaction [127.85524354900494]
In this paper, we study the S-temporal action detection problem with multi-object interaction.
We introduce a new dataset that is spatially annotated with action tubes containing multi-object interactions.
We propose an end-to-endtemporal action detection model that performs both spatial and temporal regression simultaneously.
arXiv Detail & Related papers (2020-04-01T00:54:56Z) - Actions as Moving Points [66.21507857877756]
We present a conceptually simple, efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector)
Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches.
Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets.
arXiv Detail & Related papers (2020-01-14T03:29:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.