Information Elevation Network for Fast Online Action Detection
- URL: http://arxiv.org/abs/2109.13572v1
- Date: Tue, 28 Sep 2021 09:02:15 GMT
- Title: Information Elevation Network for Fast Online Action Detection
- Authors: Sunah Min and Jinyoung Moon
- Abstract summary: Online action detection (OAD) is a task that receives video segments within a streaming video as inputs and identifies ongoing actions within them.
We introduce a novel information elevation unit (IEU) that lifts up and accumulate the past information relevant to the current action.
We design an efficient and effective OAD network using IEUs, called an information elevation network (IEN)
- Score: 4.203274985072923
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Online action detection (OAD) is a task that receives video segments within a
streaming video as inputs and identifies ongoing actions within them. It is
important to retain past information associated with a current action. However,
long short-term memory (LSTM), a popular recurrent unit for modeling temporal
information from videos, accumulates past information from the previous hidden
and cell states and the extracted visual features at each timestep without
considering the relationships between the past and current information.
Consequently, the forget gate of the original LSTM can lose the accumulated
information relevant to the current action because it determines which
information to forget without considering the current action. We introduce a
novel information elevation unit (IEU) that lifts up and accumulate the past
information relevant to the current action in order to model the past
information that is especially relevant to the current action. To the best of
our knowledge, our IEN is the first attempt that considers the computational
overhead for the practical use of OAD. Through ablation studies, we design an
efficient and effective OAD network using IEUs, called an information elevation
network (IEN). Our IEN uses visual features extracted by a fast action
recognition network taking only RGB frames because extracting optical flows
requires heavy computation overhead. On two OAD benchmark datasets, THUMOS-14
and TVSeries, our IEN outperforms state-of-the-art OAD methods using only RGB
frames. Furthermore, on the THUMOS-14 dataset, our IEN outperforms the
state-of-the-art OAD methods using two-stream features based on RGB frames and
optical flows.
Related papers
- Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - Learning to Discriminate Information for Online Action Detection:
Analysis and Application [32.4410197207228]
We propose a novel recurrent unit, named Information Discrimination Unit (IDU), which explicitly discriminates the information relevancy between an ongoing action and others.
We also present a new recurrent unit, called Information Integration Unit (IIU), for action anticipation.
Our IIU exploits the outputs from IDU as pseudo action labels as well as RGB frames to learn enriched features of observed actions effectively.
arXiv Detail & Related papers (2021-09-08T01:51:51Z) - AdaFuse: Adaptive Temporal Fusion Network for Efficient Action
Recognition [68.70214388982545]
Temporal modelling is the key for efficient video action recognition.
We introduce an adaptive temporal fusion network, called AdaFuse, that fuses channels from current and past feature maps.
Our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-10T23:31:02Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.