Fugu-MT 論文翻訳(概要): SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

論文の概要: SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

arxiv url: http://arxiv.org/abs/2606.17615v1
Date: Tue, 16 Jun 2026 07:19:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.32879
Title: SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation
Title（参考訳）: SkillMoV:Unified Multi-View Proficiency Estimationのためのプロトタイプコンディションゲーティングと混合ビュールーティング
Authors: Edoardo Bianchi, Antonio Liotta,
Abstract要約: SkillMoVは、同期ビデオからマルチシナリオの習熟度を推定するための統合フレームワークである。コアとなるSkillMoVは、Mixture-of-View Projector (MoVP)を導入している。 EgoExo4D上のSkillMoVを、6つのスキルドメインと3つの個別に訓練されたビュー設定で評価した。
参考スコア（独自算出の注目度）: 1.3893859937118993
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.
Abstract（参考訳）: ビデオから人間の熟練度を推定することは、スポーツコーチング、音楽教育、外科訓練、職場での学習など、自動スキルアセスメントの鍵となる課題である。既存のアプローチは、個々のシナリオに焦点を当てたり、共有されたマルチビューアグリゲーションに依存することが多く、不均一なカメラの視点やアクティビティドメインに適応する能力を制限する。 SkillMoVは、同期マルチビュービデオからマルチシナリオの習熟度を推定するための、統一的でパラメータ効率のよいフレームワークである。コアとなるSkillMoVは、Mixture-of-View Projector (MoVP)を導入し、Mixture-of-Expertsパラダイムをカメラ固有のビュー機能に適応させる。 MoVPは4つのステージから構成される。 (i)カメラアイデンティティの監督なしに、ビュー依存の専門家選好を学習する12のエキスパートMPPを備えたMixture-of-Viewソフトルータ二同期カメラの整列のための横断的注意三クラスレベルの参照ベクトルの表示を条件とする学習可能なプロトタイプ (iv) 最終技術埋め込みを生成するプロトタイプ条件付きゲートプロジェクション。 Ego、Exos、Ego+Exosの6つのスキルドメインと3つの個別にトレーニングされたビュー設定で、EgoExo4D上のSkillMoVを評価する。 SkillMoVは、すべてのシナリオで共同で訓練された単一のモデルでExos設定の全体的な精度を50.17%に達し、比較した手法の中で最も高い報告されたExos結果の3.57ポイントを上回った。 Ego+Exosでは、SkillMoVはその設定で報告された最良の結果(47.63%対48.20%)に近づいている。選択されたExos構成のアブレーションは、各コンポーネントを検証している: MoVルーティングは、注意集約に対する+6.61 pp、クロスビューアテンション+4.92 pp、プロトタイプアンカー+4.07 pp、確率ビュードロップアウト+3.90 pp。 LoRA適応により、SkillMoVはパラメータの23.32%しか列車を走らせておらず、LoRAのみのベースラインに対して限られたオーバーヘッドを課している。

論文の概要: SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

関連論文リスト