Inductive and Transductive Few-Shot Video Classification via Appearance
and Temporal Alignments
- URL: http://arxiv.org/abs/2207.10785v1
- Date: Thu, 21 Jul 2022 23:28:52 GMT
- Title: Inductive and Transductive Few-Shot Video Classification via Appearance
and Temporal Alignments
- Authors: Khoi D. Nguyen, Quoc-Huy Tran, Khoi Nguyen, Binh-Son Hua, Rang Nguyen
- Abstract summary: We present a novel method for few-shot video classification, which performs appearance and temporal alignments.
Our approach achieves similar or better results than previous methods on both datasets.
- Score: 17.673345523918947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel method for few-shot video classification, which performs
appearance and temporal alignments. In particular, given a pair of query and
support videos, we conduct appearance alignment via frame-level feature
matching to achieve the appearance similarity score between the videos, while
utilizing temporal order-preserving priors for obtaining the temporal
similarity score between the videos. Moreover, we introduce a few-shot video
classification framework that leverages the above appearance and temporal
similarity scores across multiple steps, namely prototype-based training and
testing as well as inductive and transductive prototype refinement. To the best
of our knowledge, our work is the first to explore transductive few-shot video
classification. Extensive experiments on both Kinetics and Something-Something
V2 datasets show that both appearance and temporal alignments are crucial for
datasets with temporal order sensitivity such as Something-Something V2. Our
approach achieves similar or better results than previous methods on both
datasets. Our code is available at https://github.com/VinAIResearch/fsvc-ata.
Related papers
- Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Temporal Alignment Prediction for Few-Shot Video Classification [17.18278071760926]
We propose Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification.
In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos.
We evaluate TAP on two video classification benchmarks including Kinetics and Something-Something V2.
arXiv Detail & Related papers (2021-07-26T05:12:27Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Temporal-Relational CrossTransformers for Few-Shot Action Recognition [82.0033565755246]
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frames between the query and videos in the support set.
Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos.
A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order CrossTransformers.
arXiv Detail & Related papers (2021-01-15T15:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.