Fugu-MT 論文翻訳(概要): Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

論文の概要: Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

arxiv url: http://arxiv.org/abs/2606.11930v2
Date: Thu, 11 Jun 2026 09:34:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 13:39:59.682404
Title: Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability
Title（参考訳）: AIを用いた個性・認知能力評価のための凍結型マルチモーダル埋め込み
Authors: Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang,
Abstract要約: 本稿では,ACM Multimedia AVI Challenge 2026について述べる。 Track1は自己報告されたHEXACO性格特性を人格関連面接応答から予測し、Track2は認知能力レベルを分類する。視覚的特徴にはCLIP、音響的特徴や文字起こしにはWhisper、テキスト表現にはRoBERTa、E5、DeBERTaV3など、凍結したマルチモーダルエンコーダを使用します。
参考スコア（独自算出の注目度）: 0.20999222360659606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.
Abstract（参考訳）: 非同期ビデオインタビュー(AVI)から心理的特徴を予測することは、ラベル付きデータセットが制限され、それぞれの応答には高次元の視覚的、音響的、言語的信号が含まれているため、AI支援インタビューアセスメントにおいて難しい問題である。本稿では,ACM マルチメディア AVI Challenge 2026 において,自己報告した HEXACO の性格特性を人格関連インタビュー応答から予測し,Track~2 は構造化された AVI 応答から認知能力レベルを分類する。この問題を小さなサンプル表現学習タスクとして扱う。事前訓練された大規模なモデルを微調整する代わりに、視覚機能にCLIP、音響機能と文字起こしにWhisper、テキスト表現にRoBERTa、E5、DeBERTaV3、低容量ダウンストリームモデルなど、凍結したマルチモーダルエンコーダを使用します。 Track~1の場合、我々の特性特異的回帰とレイトフュージョンシステムは平均精度0.2696で、公式ベースライン0.3334よりも改善されている。アブレーションの結果は、グローバルモデル(0.3189)からトレート・モデリング(0.2871)、トレート・レイト・フュージョン(0.2696)までの3段階の改善を示し、公式ベースラインに対するMSEの相対的な19.1%削減に対応している。 Track~2では、コンパクトな主題属性ベースラインが0.5781、我々のマルチモーダルアンサンブルが0.5313、どちらも0.4062である。我々は,この結果を,AVIコンテンツからの堅牢な認知的推測よりも,検証分割における主観的帰属的ショートカットの可能性の証拠として解釈する。全体として、AVIに基づく心理的アセスメントは特性特異的なマルチモーダルモデリングの恩恵を受けるが、認知能力予測にはデータセットのショートカットを慎重に制御する必要がある。

論文の概要: Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

関連論文リスト