Fugu-MT 論文翻訳(概要): View-aware Cross-modal Distillation for Multi-view Action Recognition

論文の概要: View-aware Cross-modal Distillation for Multi-view Action Recognition

arxiv url: http://arxiv.org/abs/2511.12870v1
Date: Mon, 17 Nov 2025 02:00:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:24.594958
Title: View-aware Cross-modal Distillation for Multi-view Action Recognition
Title（参考訳）: 多視点行動認識のためのビュー対応クロスモーダル蒸留
Authors: Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide,
Abstract要約: 完全教師付きマルチモーダル教師からモダリティとアノテーションに制限された学生へ知識を抽出するために,ビューアウェアなクロスモーダル知識蒸留(ViCoKD)を提案する。 ViCoKDは、クロスモーダルな注意を持つクロスモーダルなアダプタを採用しており、学生は不完全なモーダルで操作しながらマルチモーダルな相関を利用することができる。また,ビューアウェア・コンシステンシー・モジュールを提案する。ビューアウェア・コンシステンシー・モジュールはビューアライメントのミスアライメントに対処する。
参考スコア（独自算出の注目度）: 7.312418283882337
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.
Abstract（参考訳）: マルチセンサーシステムの普及により、多視点動作認識の研究が活発化している。完全に重なり合うセンサーを備えたマルチビューセットアップにおける既存のアプローチは、一貫したビューカバレッジの恩恵を受けるが、部分的に重なり合う設定では、ビューのサブセットでのみアクションが見える。多くのシステムは限られた入力モダリティしか提供せず、密度の高いフレームレベルのラベルの代わりにシーケンスレベルのアノテーションに依存しているため、現実のシナリオではこの課題はより深刻になる。本研究では,教師の指導を受けた教師からモダリティとアノテーションに制限された学生に知識を蒸留するフレームワークViCoKD(View-Aware Cross-modal Knowledge Distillation)を提案する。 ViCoKDは、クロスモーダルな注意を持つクロスモーダルなアダプタを採用しており、学生は不完全なモーダルで操作しながらマルチモーダルな相関を利用することができる。また,ビューアウェア・コンシステンシー・モジュールを提案する。ビューアウェア・コンシステンシー・モジュールはビューアライメントのミスアライメントに対処する。これは、人間の検出マスクと、予測されたクラス分布間の信頼度に富んだジェンセン・シャノンの偏差によって導かれる、ビュー間でアクションがコビジュアライズされたときの予測アライメントを強制する。実世界のMultiSensor-Homeデータセットの実験では、ViCoKDは複数のバックボーンと環境の競争蒸留法を一貫して上回り、大きな利益をもたらし、限られた条件下で教師モデルを上回っている。

論文の概要: View-aware Cross-modal Distillation for Multi-view Action Recognition

関連論文リスト