Related papers: View-aware Cross-modal Distillation for Multi-view Action Recognition

View-aware Cross-modal Distillation for Multi-view Action Recognition

URL: http://arxiv.org/abs/2511.12870v1
Date: Mon, 17 Nov 2025 02:00:22 GMT
Title: View-aware Cross-modal Distillation for Multi-view Action Recognition
Authors: Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide,
Abstract summary: We propose View-aware Cross-modal Knowledge Distillation (ViCoKD) to distill knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student.<n>ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities.<n>We also propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints.
Score: 7.312418283882337
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

Related papers

Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z)
Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention [59.19580789952102]
This paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks.<n>MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization.<n>MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations.
arXiv Detail & Related papers (2025-01-18T11:57:20Z)
Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation [30.33381342502258]
Key challenge is unimodal bias, where multimodal segmentors over rely on certain modalities, causing performance drops when others are missing.<n>We develop the first framework for learning robust segmentor that can handle any combinations of visual modalities.
arXiv Detail & Related papers (2024-11-26T06:15:27Z)
Hierarchical Mutual Distillation for Multi-View Fusion: Learning from All Possible View Combinations [0.053801353100098995]
We propose a novel Multi-View Uncertainty-Weighted Mutual Distillation (MV-UWMD) method.<n>MV-UWMD improves prediction accuracy and consistency compared to existing multi-view learning approaches.
arXiv Detail & Related papers (2024-11-15T09:45:32Z)
Towards Generalized Multi-stage Clustering: Multi-view Self-distillation [10.368796552760571]
Existing multi-stage clustering methods independently learn the salient features from multiple views and then perform the clustering task. This paper proposes a novel multi-stage deep MVC framework where multi-view self-distillation (DistilMVC) is introduced to distill dark knowledge of label distribution.
arXiv Detail & Related papers (2023-10-29T03:35:34Z)
Hierarchical Mutual Information Analysis: Towards Multi-view Clustering in The Wild [9.380271109354474]
This work proposes a deep MVC framework where data recovery and alignment are fused in a hierarchically consistent way to maximize the mutual information among different views. To the best of our knowledge, this could be the first successful attempt to handle the missing and unaligned data problem separately with different learning paradigms.
arXiv Detail & Related papers (2023-10-28T06:43:57Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance. This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z)
Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks. Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z)
Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal Human Activity Recognition [1.869225486385596]
We explore the hypothesis that leveraging multiple modalities can lead to better recognition. We extend a number of recent contrastive self-supervised approaches for the task of Human Activity Recognition. We propose a flexible, general-purpose framework for performing multimodal self-supervised learning.
arXiv Detail & Related papers (2022-05-20T10:39:16Z)
Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem. The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other. Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.