STMixer: A One-Stage Sparse Action Detector
- URL: http://arxiv.org/abs/2303.15879v1
- Date: Tue, 28 Mar 2023 10:47:06 GMT
- Title: STMixer: A One-Stage Sparse Action Detector
- Authors: Tao Wu and Mengqi Cao and Ziteng Gao and Gangshan Wu and Limin Wang
- Abstract summary: We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
- Score: 48.0614066856134
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditional video action detectors typically adopt the two-stage pipeline,
where a person detector is first employed to generate actor boxes and then 3D
RoIAlign is used to extract actor-specific features for classification. This
detection paradigm requires multi-stage training and inference, and cannot
capture context information outside the bounding box. Recently, a few
query-based action detectors are proposed to predict action instances in an
end-to-end manner. However, they still lack adaptability in feature sampling
and decoding, thus suffering from the issues of inferior performance or slower
convergence. In this paper, we propose a new one-stage sparse action detector,
termed STMixer. STMixer is based on two core designs. First, we present a
query-based adaptive feature sampling module, which endows our STMixer with the
flexibility of mining a set of discriminative features from the entire
spatiotemporal domain. Second, we devise a dual-branch feature mixing module,
which allows our STMixer to dynamically attend to and mix video features along
the spatial and the temporal dimension respectively for better feature
decoding. Coupling these two designs with a video backbone yields an efficient
end-to-end action detector. Without bells and whistles, our STMixer obtains the
state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
Related papers
- Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection [36.478530086163744]
We propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet.
MutDet fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction.
Experiments on various settings show new state-of-the-art transfer performance.
arXiv Detail & Related papers (2024-07-13T15:28:15Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - STMixer: A One-Stage Sparse Action Detector [43.62159663367588]
We propose two core designs for a more flexible one-stage action detector.
First, we sparse a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of features from the entire video-temporal domain.
Second, we devise a decoupled feature mixing module, which dynamically attends to mixes along the spatial and temporal dimensions respectively for better feature decoding.
arXiv Detail & Related papers (2024-04-15T14:52:02Z) - SODFormer: Streaming Object Detection with Transformer Using Events and
Frames [31.293847706713052]
DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
arXiv Detail & Related papers (2023-08-08T04:53:52Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Object Detection Made Simpler by Eliminating Heuristic NMS [70.93004137521946]
We show a simple NMS-free, end-to-end object detection framework.
We attain on par or even improved detection accuracy compared with the original one-stage detector.
arXiv Detail & Related papers (2021-01-28T02:38:29Z) - AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection [8.39479809973967]
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce examples.
Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component.
We present that a general few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations.
arXiv Detail & Related papers (2020-11-30T10:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.