Related papers: M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

URL: http://arxiv.org/abs/2308.03063v1
Date: Sun, 6 Aug 2023 09:15:14 GMT
Title: M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition
Authors: Hao Tang, Jun Liu, Shuanglin Yan, Rui Yan, Zechao Li, Jinhui Tang
Abstract summary: M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
Score: 80.21796574234287
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M$^3$Net, a matching-based framework for FS-FG action recognition, which incorporates \textit{multi-view encoding}, \textit{multi-view matching}, and \textit{multi-view fusion} to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints. \textit{Multi-view encoding} captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data. \textit{Multi-view matching} integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. \textit{Multi-view fusion} consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

Related papers

Multi-View Factorizing and Disentangling: A Novel Framework for Incomplete Multi-View Multi-Label Classification [9.905528765058541]
We propose a novel framework for incomplete multi-view multi-label classification (iMvMLC) Our method factorizes multi-view representations into two independent sets of factors: view-consistent and view-specific. Our framework innovatively decomposes consistent representation learning into three key sub-objectives.
arXiv Detail & Related papers (2025-01-11T12:19:20Z)
UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation [9.275683880295874]
Scene Graph Generation (SGG) aims at identifying object entities and reasoning their relationships within a given image. One-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets. The challenge in one-stage methods stems from the issue of weak entanglement. We introduce UniQ, a Unified decoder with task-specific queries architecture.
arXiv Detail & Related papers (2025-01-10T03:38:16Z)
VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding [9.048401253308123]
This paper investigates flexible organization and explicit correlation learning for multiple views. We devise a nimble Transformer model, named emphVSFormer, to explicitly capture pairwise and higher-order correlations of all elements in the set. It reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD.
arXiv Detail & Related papers (2024-09-14T01:48:54Z)
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition [9.506482334842293]
Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. We propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels.
arXiv Detail & Related papers (2024-07-17T05:42:43Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching. We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues. We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects. Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z)
HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition. The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations. We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z)
A Clustering-guided Contrastive Fusion for Multi-view Representation Learning [7.630965478083513]
We propose a deep fusion network to fuse view-specific representations into the view-common representation. We also design an asymmetrical contrastive strategy that aligns the view-common representation and each view-specific representation. In the incomplete view scenario, our proposed method resists noise interference better than those of our competitors.
arXiv Detail & Related papers (2022-12-28T07:21:05Z)
Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity [63.85428043085567]
We propose a fast multi-view clustering via ensembles (FastMICE) approach. The concept of random view groups is presented to capture the versatile view-wise relationships. FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning.
arXiv Detail & Related papers (2022-03-22T09:51:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.