We don't Need Thousand Proposals$\colon$ Single Shot Actor-Action
Detection in Videos
- URL: http://arxiv.org/abs/2011.10927v1
- Date: Sun, 22 Nov 2020 03:53:40 GMT
- Title: We don't Need Thousand Proposals$\colon$ Single Shot Actor-Action
Detection in Videos
- Authors: Aayush J Rana, Yogesh S Rawat
- Abstract summary: We propose SSA2D, a simple yet effective end-to-end deep network for actor-action detection in videos.
SSA2D is a unified network, which performs pixel level joint actor-action detection in a single-shot.
We evaluate the proposed method on the Actor-Action dataset (A2D) and Video Object Relation (VidOR) dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose SSA2D, a simple yet effective end-to-end deep network for
actor-action detection in videos. The existing methods take a top-down approach
based on region-proposals (RPN), where the action is estimated based on the
detected proposals followed by post-processing such as non-maximal suppression.
While effective in terms of performance, these methods pose limitations in
scalability for dense video scenes with a high memory requirement for thousands
of proposals. We propose to solve this problem from a different perspective
where we don't need any proposals. SSA2D is a unified network, which performs
pixel level joint actor-action detection in a single-shot, where every pixel of
the detected actor is assigned an action label. SSA2D has two main advantages:
1) It is a fully convolutional network which does not require any proposals and
post-processing making it memory as well as time efficient, 2) It is easily
scalable to dense video scenes as its memory requirement is independent of the
number of actors present in the scene. We evaluate the proposed method on the
Actor-Action dataset (A2D) and Video Object Relation (VidOR) dataset,
demonstrating its effectiveness in multiple actors and action detection in a
video. SSA2D is 11x faster during inference with comparable (sometimes better)
performance and fewer network parameters when compared with the prior works.
Related papers
- Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Zeus: Efficiently Localizing Actions in Videos using Reinforcement
Learning [8.00133208459188]
We present Zeus, a video analytics system tailored for answering action queries.
Zeus trains an agent that learns to adaptively modify the input video segments to an action classification network.
Zeus is capable of answering the query at a user-specified target accuracy using a query that trains the agent based on an accuracy-aware reward function.
arXiv Detail & Related papers (2021-04-06T16:38:31Z) - Decoupled and Memory-Reinforced Networks: Towards Effective Feature
Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network.
There are two major challenges in the current one-step approaches.
We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z) - Context-Aware RCNN: A Baseline for Action Detection in Videos [66.16989365280938]
We first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor.
We revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors.
We found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance.
arXiv Detail & Related papers (2020-07-20T03:11:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.