Two-stream joint matching method based on contrastive learning for
few-shot action recognition
- URL: http://arxiv.org/abs/2401.04150v1
- Date: Mon, 8 Jan 2024 13:37:15 GMT
- Title: Two-stream joint matching method based on contrastive learning for
few-shot action recognition
- Authors: Long Deng, Ziqiang Li, Bingxin Zhou, Zhongming Chen, Ao Li and Yongxin
Ge
- Abstract summary: We propose a Two-Stream Joint Matching method based on contrastive learning (TSJM)
The objective of the MCL is to extensively investigate the inter-modal mutual information relationships.
The JMM aims to simultaneously address the aforementioned video matching problems.
- Score: 6.657975899342652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although few-shot action recognition based on metric learning paradigm has
achieved significant success, it fails to address the following issues: (1)
inadequate action relation modeling and underutilization of multi-modal
information; (2) challenges in handling video matching problems with different
lengths and speeds, and video matching problems with misalignment of video
sub-actions. To address these issues, we propose a Two-Stream Joint Matching
method based on contrastive learning (TSJM), which consists of two modules:
Multi-modal Contrastive Learning Module (MCL) and Joint Matching Module (JMM).
The objective of the MCL is to extensively investigate the inter-modal mutual
information relationships, thereby thoroughly extracting modal information to
enhance the modeling of action relationships. The JMM aims to simultaneously
address the aforementioned video matching problems. The effectiveness of the
proposed method is evaluated on two widely used few shot action recognition
datasets, namely, SSv2 and Kinetics. Comprehensive ablation experiments are
also conducted to substantiate the efficacy of our proposed approach.
Related papers
- Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching [10.709744162565274]
We propose a novel method called DIAS to bridge the modality gap from two aspects.
The method achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.
arXiv Detail & Related papers (2024-10-22T09:37:29Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components.
The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode.
We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z) - MRI-based Multi-task Decoupling Learning for Alzheimer's Disease
Detection and MMSE Score Prediction: A Multi-site Validation [9.427540028148963]
Accurately detecting Alzheimer's disease (AD) and predicting mini-mental state examination (MMSE) score are important tasks in elderly health by magnetic resonance imaging (MRI)
Most of the previous methods on these two tasks are based on single-task learning and rarely consider the correlation between them.
We propose a MRI-based multi-task decoupled learning method for AD detection and MMSE score prediction.
arXiv Detail & Related papers (2022-04-02T09:19:18Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.