Few-shot Action Recognition via Intra- and Inter-Video Information
Maximization
- URL: http://arxiv.org/abs/2305.06114v1
- Date: Wed, 10 May 2023 13:05:43 GMT
- Title: Few-shot Action Recognition via Intra- and Inter-Video Information
Maximization
- Authors: Huabin Liu, Weiyao Lin, Tieyuan Chen, Yuxi Li, Shuyuan Li, John See
- Abstract summary: We propose a novel framework, Video Information Maximization (VIM), for few-shot action recognition.
VIM is equipped with an adaptive spatial-temporal video sampler and atemporal action alignment model.
VIM acts to maximize the distinctiveness of video information from limited video data.
- Score: 28.31541961943443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current few-shot action recognition involves two primary sources of
information for classification:(1) intra-video information, determined by frame
content within a single video clip, and (2) inter-video information, measured
by relationships (e.g., feature similarity) among videos. However, existing
methods inadequately exploit these two information sources. In terms of
intra-video information, current sampling operations for input videos may omit
critical action information, reducing the utilization efficiency of video data.
For the inter-video information, the action misalignment among videos makes it
challenging to calculate precise relationships. Moreover, how to jointly
consider both inter- and intra-video information remains under-explored for
few-shot action recognition. To this end, we propose a novel framework, Video
Information Maximization (VIM), for few-shot video action recognition. VIM is
equipped with an adaptive spatial-temporal video sampler and a spatiotemporal
action alignment model to maximize intra- and inter-video information,
respectively. The video sampler adaptively selects important frames and
amplifies critical spatial regions for each input video based on the task at
hand. This preserves and emphasizes informative parts of video clips while
eliminating interference at the data level. The alignment model performs
temporal and spatial action alignment sequentially at the feature level,
leading to more precise measurements of inter-video similarity. Finally, These
goals are facilitated by incorporating additional loss terms based on mutual
information measurement. Consequently, VIM acts to maximize the distinctiveness
of video information from limited video data. Extensive experimental results on
public datasets for few-shot action recognition demonstrate the effectiveness
and benefits of our framework.
Related papers
- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring [44.30048301161034]
Video deblurring aims to enhance the quality of restored results in motion-red videos by gathering information from adjacent video frames.
We propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, and 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames.
We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets.
arXiv Detail & Related papers (2024-08-27T10:09:17Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - A gaze driven fast-forward method for first-person videos [2.362412515574206]
We address the problem of accessing relevant information in First-Person Videos by creating an accelerated version of the input video and emphasizing the important moments to the recorder.
Our method is based on an attention model driven by gaze and visual scene analysis that provides a semantic score of each frame of the input video.
arXiv Detail & Related papers (2020-06-10T00:08:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.