M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition
- URL: http://arxiv.org/abs/2308.03063v1
- Date: Sun, 6 Aug 2023 09:15:14 GMT
- Title: M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition
- Authors: Hao Tang, Jun Liu, Shuanglin Yan, Rui Yan, Zechao Li, Jinhui Tang
- Abstract summary: M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
- Score: 80.21796574234287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the scarcity of manually annotated data required for fine-grained
video understanding, few-shot fine-grained (FS-FG) action recognition has
gained significant attention, with the aim of classifying novel fine-grained
action categories with only a few labeled instances. Despite the progress made
in FS coarse-grained action recognition, current approaches encounter two
challenges when dealing with the fine-grained action categories: the inability
to capture subtle action details and the insufficiency of learning from limited
data that exhibit high intra-class variance and inter-class similarity. To
address these limitations, we propose M$^3$Net, a matching-based framework for
FS-FG action recognition, which incorporates \textit{multi-view encoding},
\textit{multi-view matching}, and \textit{multi-view fusion} to facilitate
embedding encoding, similarity matching, and decision making across multiple
viewpoints. \textit{Multi-view encoding} captures rich contextual details from
the intra-frame, intra-video, and intra-episode perspectives, generating
customized higher-order embeddings for fine-grained data. \textit{Multi-view
matching} integrates various matching functions enabling flexible relation
modeling within limited samples to handle multi-scale spatio-temporal
variations by leveraging the instance-specific, category-specific, and
task-specific perspectives. \textit{Multi-view fusion} consists of
matching-predictions fusion and matching-losses fusion over the above views,
where the former promotes mutual complementarity and the latter enhances
embedding generalizability by employing multi-task collaborative learning.
Explainable visualizations and experimental results on three challenging
benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained
action details and achieving state-of-the-art performance for FS-FG action
recognition.
Related papers
- Multi-View Factorizing and Disentangling: A Novel Framework for Incomplete Multi-View Multi-Label Classification [9.905528765058541]
We propose a novel framework for incomplete multi-view multi-label classification (iMvMLC)
Our method factorizes multi-view representations into two independent sets of factors: view-consistent and view-specific.
Our framework innovatively decomposes consistent representation learning into three key sub-objectives.
arXiv Detail & Related papers (2025-01-11T12:19:20Z) - UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation [9.275683880295874]
Scene Graph Generation (SGG) aims at identifying object entities and reasoning their relationships within a given image.
One-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets.
The challenge in one-stage methods stems from the issue of weak entanglement.
We introduce UniQ, a Unified decoder with task-specific queries architecture.
arXiv Detail & Related papers (2025-01-10T03:38:16Z) - Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition [9.506482334842293]
Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task.
Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task.
We propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels.
arXiv Detail & Related papers (2024-07-17T05:42:43Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-interactive Feature Learning and a Full-time Multi-modality
Benchmark for Image Fusion and Segmentation [66.15246197473897]
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation.
We propose a textbfMulti-textbfinteractive textbfFeature learning architecture for image fusion and textbfSegmentation.
arXiv Detail & Related papers (2023-08-04T01:03:58Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - A Clustering-guided Contrastive Fusion for Multi-view Representation
Learning [7.630965478083513]
We propose a deep fusion network to fuse view-specific representations into the view-common representation.
We also design an asymmetrical contrastive strategy that aligns the view-common representation and each view-specific representation.
In the incomplete view scenario, our proposed method resists noise interference better than those of our competitors.
arXiv Detail & Related papers (2022-12-28T07:21:05Z) - Fast Multi-view Clustering via Ensembles: Towards Scalability,
Superiority, and Simplicity [63.85428043085567]
We propose a fast multi-view clustering via ensembles (FastMICE) approach.
The concept of random view groups is presented to capture the versatile view-wise relationships.
FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning.
arXiv Detail & Related papers (2022-03-22T09:51:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.