Efficient Human Vision Inspired Action Recognition using Adaptive
Spatiotemporal Sampling
- URL: http://arxiv.org/abs/2207.05249v2
- Date: Wed, 13 Jul 2022 15:13:34 GMT
- Title: Efficient Human Vision Inspired Action Recognition using Adaptive
Spatiotemporal Sampling
- Authors: Khoi-Nguyen C. Mac, Minh N. Do, Minh P. Vo
- Abstract summary: We introduce a novel adaptive vision system for efficient action recognition processing.
Our system pre-scans the global context sampling scheme at low-resolution and decides to skip or request high-resolution features at salient regions for further processing.
We validate the system on EPIC-KENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-the-art baselines.
- Score: 13.427887784558168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive sampling that exploits the spatiotemporal redundancy in videos is
critical for always-on action recognition on wearable devices with limited
computing and battery resources. The commonly used fixed sampling strategy is
not context-aware and may under-sample the visual content, and thus adversely
impacts both computation efficiency and accuracy. Inspired by the concepts of
foveal vision and pre-attentive processing from the human visual perception
mechanism, we introduce a novel adaptive spatiotemporal sampling scheme for
efficient action recognition. Our system pre-scans the global scene context at
low-resolution and decides to skip or request high-resolution features at
salient regions for further processing. We validate the system on EPIC-KITCHENS
and UCF-101 datasets for action recognition, and show that our proposed
approach can greatly speed up inference with a tolerable loss of accuracy
compared with those from state-of-the-art baselines.
Related papers
- Perceptual Piercing: Human Visual Cue-based Object Detection in Low Visibility Conditions [2.0409124291940826]
This study proposes a novel deep learning framework inspired by atmospheric scattering and human visual cortex mechanisms to enhance object detection under poor visibility scenarios such as fog, smoke, and haze.
The objective is to enhance the precision and reliability of detection systems under adverse environmental conditions.
arXiv Detail & Related papers (2024-10-02T04:03:07Z) - VHS: High-Resolution Iterative Stereo Matching with Visual Hull Priors [3.523208537466128]
We present a stereo-matching method for depth estimation from high-resolution images using visual hulls as priors.
Our method uses object masks extracted from supplementary views of the scene to guide the disparity estimation, effectively reducing the search space for matches.
This approach is specifically tailored to stereo rigs in volumetric capture systems, where an accurate depth plays a key role in the downstream reconstruction task.
arXiv Detail & Related papers (2024-06-04T17:59:57Z) - Sample Less, Learn More: Efficient Action Recognition via Frame Feature
Restoration [59.6021678234829]
We propose a novel method to restore the intermediate features for two sparsely sampled and adjacent video frames.
With the integration of our method, the efficiency of three commonly used baselines has been improved by over 50%, with a mere 0.5% reduction in recognition accuracy.
arXiv Detail & Related papers (2023-07-27T13:52:42Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Accurate and Real-time Pseudo Lidar Detection: Is Stereo Neural Network
Really Necessary? [6.8067583993953775]
We develop a system with a less powerful stereo matching predictor and adopt the proposed refinement schemes to improve the accuracy.
The presented system achieves competitive accuracy to the state-of-the-art approaches with only 23 ms computing, showing it is a suitable candidate for deploying to real car-hold applications.
arXiv Detail & Related papers (2022-06-28T09:53:00Z) - Scalable Vehicle Re-Identification via Self-Supervision [66.2562538902156]
Vehicle Re-Identification is one of the key elements in city-scale vehicle analytics systems.
Many state-of-the-art solutions for vehicle re-id mostly focus on improving the accuracy on existing re-id benchmarks and often ignore computational complexity.
We propose a simple yet effective hybrid solution empowered by self-supervised training which only uses a single network during inference time.
arXiv Detail & Related papers (2022-05-16T12:14:42Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Object-based Illumination Estimation with Rendering-aware Neural
Networks [56.01734918693844]
We present a scheme for fast environment light estimation from the RGBD appearance of individual objects and their local image areas.
With the estimated lighting, virtual objects can be rendered in AR scenarios with shading that is consistent to the real scene.
arXiv Detail & Related papers (2020-08-06T08:23:19Z) - AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [70.62587948892633]
Action recognition is an open and challenging problem in computer vision.
We propose a novel approach, called AR-Net, that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition.
arXiv Detail & Related papers (2020-07-31T01:36:04Z) - Monocular Real-Time Volumetric Performance Capture [28.481131687883256]
We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video.
Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu)
We also introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples.
arXiv Detail & Related papers (2020-07-28T04:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.