HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition
- URL: http://arxiv.org/abs/2301.03330v1
- Date: Mon, 9 Jan 2023 13:32:50 GMT
- Title: HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition
- Authors: Xiang Wang, Shiwei Zhang, Zhiwu Qing, Zhengrong Zuo, Changxin Gao,
Rong Jin, Nong Sang
- Abstract summary: We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
- Score: 51.2715005161475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent attempts mainly focus on learning deep representations for each video
individually under the episodic meta-learning regime and then performing
temporal alignment to match query and support videos. However, they still
suffer from two drawbacks: (i) learning individual features without considering
the entire task may result in limited representation capability, and (ii)
existing alignment strategies are sensitive to noises and misaligned instances.
To handle the two limitations, we propose a novel Hybrid Relation guided
temporal Set Matching (HyRSM++) approach for few-shot action recognition. The
core idea of HyRSM++ is to integrate all videos within the task to learn
discriminative representations and involve a robust matching technique. To be
specific, HyRSM++ consists of two key components, a hybrid relation module and
a temporal set matching metric. Given the basic representations from the
feature extractor, the hybrid relation module is introduced to fully exploit
associated relations within and cross videos in an episodic task and thus can
learn task-specific embeddings. Subsequently, in the temporal set matching
metric, we carry out the distance measure between query and support videos from
a set matching perspective and design a Bi-MHM to improve the resilience to
misaligned instances. In addition, we explicitly exploit the temporal coherence
in videos to regularize the matching process. Furthermore, we extend the
proposed HyRSM++ to deal with the more challenging semi-supervised few-shot
action recognition and unsupervised few-shot action recognition tasks.
Experimental results on multiple benchmarks demonstrate that our method
achieves state-of-the-art performance under various few-shot settings. The
source code is available at
https://github.com/alibaba-mmai-research/HyRSMPlusPlus.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition [36.426688592783975]
MVP-Shot is a framework to learn and align semantic-related action features at multi-velocity levels.
MVFA module measures similarity between features from support and query videos with different velocity scales.
PST module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains.
arXiv Detail & Related papers (2024-05-03T13:10:16Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection [9.032057312774564]
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks.
Several methods have been devoted to building DETR-based networks to solve both MR and HD jointly.
We propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD.
arXiv Detail & Related papers (2024-01-04T14:55:57Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components.
The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode.
We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.