Fugu-MT 論文翻訳(概要): ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

論文の概要: ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

arxiv url: http://arxiv.org/abs/2509.26278v1
Date: Tue, 30 Sep 2025 14:00:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.563307
Title: ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Title（参考訳）: ProfVLM:マルチビュー能率推定のための軽量ビデオランゲージモデル
Authors: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta,
Abstract要約: 本稿では,このタスクを生成的推論として再構成する,コンパクトな視覚言語モデルProfVLMを提案する。スキルレベルを共同で予測し、エゴセントリックなビデオやエクソセントリックなビデオから専門家のようなフィードバックを生成する。
参考スコア（独自算出の注目度）: 3.115853870709636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.
Abstract（参考訳）: スキルの熟練度推定への既存のアプローチは、しばしばブラックボックスビデオ分類器に依存し、マルチビューコンテキストを無視し、説明性に欠ける。本稿では,このタスクを生成的推論として再構成する,コンパクトな視覚言語モデルProfVLMを提案する。我々の手法の中心は、凍結したTimeSformerのバックボーンから動的にマルチビュー機能を融合するAttentiveGatedProjectorで、フィードバック生成用に調整された言語モデルに投影する。専門家のコメントでEgoExo4DでトレーニングされているProfVLMは、最先端のメソッドを超越し、パラメータを最大20倍少なくし、トレーニング時間を最大60%短縮する。提案手法は,多種多様な活動において優れた精度を達成できるだけでなく,性能に合わせた自然言語批判を出力し,透明な推論を提供する。これらの結果は、生成的視覚言語モデリングを、スキルアセスメントのための強力な新しい方向性として強調する。

論文の概要: ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

関連論文リスト