Fugu-MT 論文翻訳(概要): VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

論文の概要: VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

arxiv url: http://arxiv.org/abs/2605.18547v1
Date: Mon, 18 May 2026 15:27:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.910371
Title: VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
Title（参考訳）: VISAFF:会話における感情認識のための話者中心型視覚効果特徴学習
Authors: Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, Xiangfan Chen, Xiangjie Kong, Guojiang Shen,
Abstract要約: 会話における感情認識(ERC)は人間と機械の効果的な相互作用に不可欠である。近年のビジョン・ランゲージ・モデル(VLM)は本質的にERCに適合していない。話者中心型VISual AFFective機能学習フレームワークであるVISAFFを提案する。
参考スコア（独自算出の注目度）: 17.099995082943735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.
Abstract（参考訳）: Emotion Recognition in Conversation (ERC) は,マルチターン対話における話者の感情状態の同定を目的とした,効果的な人間と機械の相互作用に不可欠である。初期のテキストベースの手法は、本質的に重要な非言語情報を無視しているため、サルカズムのような複雑なシナリオに苦しむ。近年のVLM(Vision-Language Models)では、ビデオを直接解析することでこの問題に対処しているが、本来はERCに適したものではなく、アクティブスピーカーよりも感情的に無関係な背景領域や受動的リスナーに重点を置いていることが多い。さらに、これらの大きなモデルを微調整すると、計算コストが禁じられる。さらに、孤立した視覚信号は、言語内容や音声韻律の文脈なしに、しばしば曖昧または技術的に妥協される。これらの課題に対処するため,話者中心型VISual AFFective機能学習フレームワークであるVISAFFを提案する。 VISAFFは、話者中心のAffective Groundingと信頼性誘導のAffective Complementationの2つのステージで構成されている。 VISAFFは、凍結したVLMの推論能力を解き放つためにチューニング不要なアプローチを採用しており、アクティブな話者の感情的な視覚的手がかりに集中するよう効率的に操縦する。第2段階では,視覚の不確かさを補うために,テキストと音響のモーダルを動的に活用する信頼性誘導感情補完機構を導入する。 2つの実世界のデータセットの実験により、VISAFFは、チューニング不要な環境での最先端の手法と比較して高い競争性能を達成し、大規模VLMの高価な微調整の必要性を排除し、計算効率を大幅に向上することを示した。ソースコードはhttps://anonymous.4open.science/r/speaker-2365/で公開されている。

論文の概要: VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

関連論文リスト