M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition
- URL: http://arxiv.org/abs/2308.03063v1
- Date: Sun, 6 Aug 2023 09:15:14 GMT
- Title: M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition
- Authors: Hao Tang, Jun Liu, Shuanglin Yan, Rui Yan, Zechao Li, Jinhui Tang
- Abstract summary: M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
- Score: 80.21796574234287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the scarcity of manually annotated data required for fine-grained
video understanding, few-shot fine-grained (FS-FG) action recognition has
gained significant attention, with the aim of classifying novel fine-grained
action categories with only a few labeled instances. Despite the progress made
in FS coarse-grained action recognition, current approaches encounter two
challenges when dealing with the fine-grained action categories: the inability
to capture subtle action details and the insufficiency of learning from limited
data that exhibit high intra-class variance and inter-class similarity. To
address these limitations, we propose M$^3$Net, a matching-based framework for
FS-FG action recognition, which incorporates \textit{multi-view encoding},
\textit{multi-view matching}, and \textit{multi-view fusion} to facilitate
embedding encoding, similarity matching, and decision making across multiple
viewpoints. \textit{Multi-view encoding} captures rich contextual details from
the intra-frame, intra-video, and intra-episode perspectives, generating
customized higher-order embeddings for fine-grained data. \textit{Multi-view
matching} integrates various matching functions enabling flexible relation
modeling within limited samples to handle multi-scale spatio-temporal
variations by leveraging the instance-specific, category-specific, and
task-specific perspectives. \textit{Multi-view fusion} consists of
matching-predictions fusion and matching-losses fusion over the above views,
where the former promotes mutual complementarity and the latter enhances
embedding generalizability by employing multi-task collaborative learning.
Explainable visualizations and experimental results on three challenging
benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained
action details and achieving state-of-the-art performance for FS-FG action
recognition.
Related papers
- VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding [9.048401253308123]
This paper investigates flexible organization and explicit correlation learning for multiple views.
We devise a nimble Transformer model, named emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set.
It reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD.
arXiv Detail & Related papers (2024-09-14T01:48:54Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - A Clustering-guided Contrastive Fusion for Multi-view Representation
Learning [7.630965478083513]
We propose a deep fusion network to fuse view-specific representations into the view-common representation.
We also design an asymmetrical contrastive strategy that aligns the view-common representation and each view-specific representation.
In the incomplete view scenario, our proposed method resists noise interference better than those of our competitors.
arXiv Detail & Related papers (2022-12-28T07:21:05Z) - Fast Multi-view Clustering via Ensembles: Towards Scalability,
Superiority, and Simplicity [63.85428043085567]
We propose a fast multi-view clustering via ensembles (FastMICE) approach.
The concept of random view groups is presented to capture the versatile view-wise relationships.
FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning.
arXiv Detail & Related papers (2022-03-22T09:51:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.