Fugu-MT 論文翻訳(概要): Why Do Vision Language Models Struggle To Recognize Human Emotions?

論文の概要: Why Do Vision Language Models Struggle To Recognize Human Emotions?

arxiv url: http://arxiv.org/abs/2604.15280v1
Date: Thu, 16 Apr 2026 17:49:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:32.036998
Title: Why Do Vision Language Models Struggle To Recognize Human Emotions?
Title（参考訳）: 視覚言語モデルはなぜ人間の感情を認識するのか?
Authors: Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh,
Abstract要約: 視覚モデル(VLM)は人間の感情を認識するのに苦労していることを示す。表情認識(DFER)は、2つの重要なVLM脆弱性を露呈する。本稿では、一般的な概念を好まないための代替的なサンプリング戦略を提案する。
参考スコア（独自算出の注目度）: 16.54642537638597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.
Abstract（参考訳）: 感情を理解することは、知的なシステムが人間と対話できる基本的な能力である。視覚言語モデル(VLM)は、過去数年間で多くの視覚的タスクにおいて大きな進歩を遂げており、感情を理解するための有望なソリューションを提供する可能性がある。しかしながら、最も洗練された現代のVLMでさえ、人間の感情を認識したり、特別な視覚のみの分類器よりも優れていることに苦戦しているのは驚くべきことである。本稿では,「なぜVLMは人間の感情を認識するのに苦労するのか?」という問いに対して,表情認識(DFER)の本質的に連続的でダイナミックなタスクが2つの重要なVLM脆弱性を露呈していることを考察する。まず、感情データセットは自然に長い尾を持ち、VLMを事前訓練するWebスケールのデータは、このヘッドクラスのバイアスを悪化させ、希少で表現の少ない感情を共通のカテゴリに体系的に崩壊させる。本稿では、一般的な概念を好まないための代替的なサンプリング戦略を提案する。第二に、時間的情報は感情を理解するために重要である。しかしながら、VLMは、コンテキストサイズやメモリに収まるトークンの数によって制限されているため、高密度なフレームシーケンス上の時間情報を表現できないため、感情認識には明らかな課題が生じる。 VLMのスパース時間サンプリング戦略は、しばしば最も重要な感情信号であるマイクロプレッション(0.25-0.5秒)のフリーティング特性と本質的に一致しないことを実証する。診断用プローブとして,まず自然言語の要約に変換することで,「間」のフレームからの情報を利用する多段階のコンテキストエンリッチメント戦略を提案する。このリッチテキストコンテキストは、スパースキーフレームと共にVLMへの入力として提供され、感情的軌跡を保ちながら、過度な視覚データからの注意の希釈を防止する。

論文の概要: Why Do Vision Language Models Struggle To Recognize Human Emotions?

関連論文リスト