Zeus: Efficiently Localizing Actions in Videos using Reinforcement
Learning
- URL: http://arxiv.org/abs/2104.06142v2
- Date: Mon, 19 Apr 2021 03:20:48 GMT
- Title: Zeus: Efficiently Localizing Actions in Videos using Reinforcement
Learning
- Authors: Pramod Chunduri, Jaeho Bang, Yao Lu, Joy Arulraj
- Abstract summary: We present Zeus, a video analytics system tailored for answering action queries.
Zeus trains an agent that learns to adaptively modify the input video segments to an action classification network.
Zeus is capable of answering the query at a user-specified target accuracy using a query that trains the agent based on an accuracy-aware reward function.
- Score: 8.00133208459188
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Detection and localization of actions in videos is an important problem in
practice. A traffic analyst might be interested in studying the patterns in
which vehicles move at a given intersection. State-of-the-art video analytics
systems are unable to efficiently and effectively answer such action queries.
The reasons are threefold. First, action detection and localization tasks
require computationally expensive deep neural networks. Second, actions are
often rare events. Third, actions are spread across a sequence of frames. It is
important to take the entire sequence of frames into context for effectively
answering the query. It is critical to quickly skim through the irrelevant
parts of the video to answer the action query efficiently.
In this paper, we present Zeus, a video analytics system tailored for
answering action queries. We propose a novel technique for efficiently
answering these queries using a deep reinforcement learning agent. Zeus trains
an agent that learns to adaptively modify the input video segments to an action
classification network. The agent alters the input segments along three
dimensions -- sampling rate, segment length, and resolution. Besides
efficiency, Zeus is capable of answering the query at a user-specified target
accuracy using a query optimizer that trains the agent based on an
accuracy-aware reward function. Our evaluation of Zeus on a novel action
localization dataset shows that it outperforms the state-of-the-art frame- and
window-based techniques by up to 1.4x and 3x, respectively. Furthermore, unlike
the frame-based technique, it satisfies the user-specified target accuracy
across all the queries, at up to 2x higher accuracy, than frame-based methods.
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - We don't Need Thousand Proposals$\colon$ Single Shot Actor-Action
Detection in Videos [0.0]
We propose SSA2D, a simple yet effective end-to-end deep network for actor-action detection in videos.
SSA2D is a unified network, which performs pixel level joint actor-action detection in a single-shot.
We evaluate the proposed method on the Actor-Action dataset (A2D) and Video Object Relation (VidOR) dataset.
arXiv Detail & Related papers (2020-11-22T03:53:40Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action
Spotting in Videos [0.0]
ActionSpotter is a spotting algorithm that takes advantage of Deep Reinforcement Learning to efficiently spot actions while adapting its video browsing speed.
In particular, the spotting mean Average Precision on THUMOS14 is significantly improved from 59.7% to 65.6% while skipping 23% of video.
arXiv Detail & Related papers (2020-04-15T09:36:37Z) - Video Monitoring Queries [16.7214343633499]
We study the problem of interactive declarative query processing on video streams.
We introduce a set of approximate filters to speed up queries that involve objects of specific type.
The filters are able to assess quickly if the query predicates are true to proceed with further analysis of the frame.
arXiv Detail & Related papers (2020-02-24T20:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.