Selective Feature Compression for Efficient Activity Recognition
Inference
- URL: http://arxiv.org/abs/2104.00179v1
- Date: Thu, 1 Apr 2021 00:54:51 GMT
- Title: Selective Feature Compression for Efficient Activity Recognition
Inference
- Authors: Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe
- Abstract summary: Selective Feature Compression (SFC) is an action recognition inference strategy that greatly increase model inference efficiency without any accuracy compromise.
Our experiments on Kinetics-400, UCF101 and ActivityNet show that SFC is able to reduce inference speed by 6-7x memory and dimension usage by 5-6x compared with the commonly used 30 crops dense procedure sampling.
- Score: 26.43512549990624
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most action recognition solutions rely on dense sampling to precisely cover
the informative temporal clip. Extensively searching temporal region is
expensive for a real-world application. In this work, we focus on improving the
inference efficiency of current action recognition backbones on trimmed videos,
and illustrate that one action model can also cover then informative region by
dropping non-informative features. We present Selective Feature Compression
(SFC), an action recognition inference strategy that greatly increase model
inference efficiency without any accuracy compromise. Differently from previous
works that compress kernel sizes and decrease the channel dimension, we propose
to compress feature flow at spatio-temporal dimension without changing any
backbone parameters. Our experiments on Kinetics-400, UCF101 and ActivityNet
show that SFC is able to reduce inference speed by 6-7x and memory usage by
5-6x compared with the commonly used 30 crops dense sampling procedure, while
also slightly improving Top1 Accuracy. We thoroughly quantitatively and
qualitatively evaluate SFC and all its components and show how does SFC learn
to attend to important video regions and to drop temporal features that are
uninformative for the task of action recognition.
Related papers
- Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Exploring Long- and Short-Range Temporal Information for Learned Video
Compression [54.91301930491466]
We focus on exploiting the unique characteristics of video content and exploring temporal information to enhance compression performance.
For long-range temporal information exploitation, we propose temporal prior that can update continuously within the group of pictures (GOP) during inference.
In that case temporal prior contains valuable temporal information of all decoded images within the current GOP.
In detail, we design a hierarchical structure to achieve multi-scale compensation.
arXiv Detail & Related papers (2022-08-07T15:57:18Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Efficient Human Vision Inspired Action Recognition using Adaptive
Spatiotemporal Sampling [13.427887784558168]
We introduce a novel adaptive vision system for efficient action recognition processing.
Our system pre-scans the global context sampling scheme at low-resolution and decides to skip or request high-resolution features at salient regions for further processing.
We validate the system on EPIC-KENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-the-art baselines.
arXiv Detail & Related papers (2022-07-12T01:18:58Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - AdaFuse: Adaptive Temporal Fusion Network for Efficient Action
Recognition [68.70214388982545]
Temporal modelling is the key for efficient video action recognition.
We introduce an adaptive temporal fusion network, called AdaFuse, that fuses channels from current and past feature maps.
Our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-10T23:31:02Z) - Finding Action Tubes with a Sparse-to-Dense Framework [62.60742627484788]
We propose a framework that generates action tube proposals from video streams with a single forward pass in a sparse-to-dense manner.
We evaluate the efficacy of our model on the UCF101-24, JHMDB-21 and UCFSports benchmark datasets.
arXiv Detail & Related papers (2020-08-30T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.