Context-Aware RCNN: A Baseline for Action Detection in Videos
- URL: http://arxiv.org/abs/2007.09861v1
- Date: Mon, 20 Jul 2020 03:11:48 GMT
- Title: Context-Aware RCNN: A Baseline for Action Detection in Videos
- Authors: Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, Gangshan Wu
- Abstract summary: We first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor.
We revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors.
We found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance.
- Score: 66.16989365280938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video action detection approaches usually conduct actor-centric action
recognition over RoI-pooled features following the standard pipeline of
Faster-RCNN. In this work, we first empirically find the recognition accuracy
is highly correlated with the bounding box size of an actor, and thus higher
resolution of actors contributes to better performance. However, video models
require dense sampling in time to achieve accurate recognition. To fit in GPU
memory, the frames to backbone network must be kept low-resolution, resulting
in a coarse feature map in RoI-Pooling layer. Thus, we revisit RCNN for
actor-centric action recognition via cropping and resizing image patches around
actors before feature extraction with I3D deep network. Moreover, we found that
expanding actor bounding boxes slightly and fusing the context features can
further boost the performance. Consequently, we develop a surpringly effective
baseline (Context-Aware RCNN) and it achieves new state-of-the-art results on
two challenging action detection benchmarks of AVA and JHMDB. Our observations
challenge the conventional wisdom of RoI-Pooling based pipeline and encourage
researchers rethink the importance of resolution in actor-centric action
recognition. Our approach can serve as a strong baseline for video action
detection and is expected to inspire new ideas for this filed. The code is
available at \url{https://github.com/MCG-NJU/CRCNN-Action}.
Related papers
- Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection [41.35861722481721]
Deepfake threats to society and cybersecurity have provoked significant public apprehension.
This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL)
TALL transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
arXiv Detail & Related papers (2024-03-15T12:48:44Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - An Empirical Study of Remote Sensing Pretraining [117.90699699469639]
We conduct an empirical study of remote sensing pretraining (RSP) on aerial images.
RSP can help deliver distinctive performances in scene recognition tasks.
RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, but it may still suffer from task discrepancies.
arXiv Detail & Related papers (2022-04-06T13:38:11Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - We don't Need Thousand Proposals$\colon$ Single Shot Actor-Action
Detection in Videos [0.0]
We propose SSA2D, a simple yet effective end-to-end deep network for actor-action detection in videos.
SSA2D is a unified network, which performs pixel level joint actor-action detection in a single-shot.
We evaluate the proposed method on the Actor-Action dataset (A2D) and Video Object Relation (VidOR) dataset.
arXiv Detail & Related papers (2020-11-22T03:53:40Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z) - Depthwise Non-local Module for Fast Salient Object Detection Using a
Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection.
The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.