PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action
Recognition
- URL: http://arxiv.org/abs/2304.06866v2
- Date: Wed, 15 Nov 2023 23:35:32 GMT
- Title: PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action
Recognition
- Authors: Ruiqi Xian, Xijun Wang, Divya Kothandaraman, Dinesh Manocha
- Abstract summary: We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames.
We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function.
Our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.
- Score: 52.78234467516168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new algorithm for selection of informative frames in video
action recognition. Our approach is designed for aerial videos captured using a
moving camera where human actors occupy a small spatial resolution of video
frames. Our algorithm utilizes the motion bias within aerial videos, which
enables the selection of motion-salient frames. We introduce the concept of
patch mutual information (PMI) score to quantify the motion bias between
adjacent frames, by measuring the similarity of patches. We use this score to
assess the amount of discriminative motion information contained in one frame
relative to another. We present an adaptive frame selection strategy using
shifted leaky ReLu and cumulative distribution function, which ensures that the
sampled frames comprehensively cover all the essential segments with high
motion salience. Our approach can be integrated with any action recognition
model to enhance its accuracy. In practice, our method achieves a relative
improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone,
and 9.0% on Diving48 datasets.
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - MITFAS: Mutual Information based Temporal Feature Alignment and Sampling
for Aerial Video Action Recognition [59.905048445296906]
We present a novel approach for action recognition in UAV videos.
We use the concept of mutual information to compute and align the regions corresponding to human action or motion in the temporal domain.
In practice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods.
arXiv Detail & Related papers (2023-03-05T04:05:17Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Recurrent Video Deblurring with Blur-Invariant Motion Estimation and
Pixel Volumes [14.384467317051831]
We propose two novel approaches to deblurring videos by effectively aggregating information from multiple video frames.
First, we present blur-invariant motion estimation learning to improve motion estimation accuracy between blurry frames.
Second, for motion compensation, instead of aligning frames by warping with estimated motions, we use a pixel volume that contains candidate sharp pixels to resolve motion estimation errors.
arXiv Detail & Related papers (2021-08-23T07:36:49Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space.
Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.