Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2207.09759v1
- Date: Wed, 20 Jul 2022 09:04:12 GMT
- Title: Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition
- Authors: Huabin Liu, Weixian Lv, John See, Weiyao Lin
- Abstract summary: We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
- Score: 25.888314212797436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A primary challenge faced in few-shot action recognition is inadequate video
data for training. To address this issue, current methods in this field mainly
focus on devising algorithms at the feature level while little attention is
paid to processing input video data. Moreover, existing frame sampling
strategies may omit critical action information in temporal and spatial
dimensions, which further impacts video utilization efficiency. In this paper,
we propose a novel video frame sampler for few-shot action recognition to
address this issue, where task-specific spatial-temporal frame sampling is
achieved via a temporal selector (TS) and a spatial amplifier (SA).
Specifically, our sampler first scans the whole video at a small computational
cost to obtain a global perception of video frames. The TS plays its role in
selecting top-T frames that contribute most significantly and subsequently. The
SA emphasizes the discriminative information of each frame by amplifying
critical regions with the guidance of saliency maps. We further adopt
task-adaptive learning to dynamically adjust the sampling strategy according to
the episode task at hand. Both the implementations of TS and SA are
differentiable for end-to-end optimization, facilitating seamless integration
of our proposed sampler with most few-shot action recognition methods.
Extensive experiments show a significant boost in the performances on various
benchmarks including long-term videos.
Related papers
- HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process.
We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z) - Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.