Fugu-MT 論文翻訳(概要): CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

論文の概要: CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

arxiv url: http://arxiv.org/abs/2603.27999v1
Date: Mon, 30 Mar 2026 03:39:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.216556
Title: CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition
Title（参考訳）: CLIP-AUTT:細粒度映像感情認識のためのアクションユニットプロンプトによるテスト時間パーソナライズ
Authors: Muhammad Osama Zeeshan, Masoumeh Sharafi, Benoît Savary, Alessandro Lameiras Koerich, Marco Pedersoli, Eric Granger,
Abstract要約: アクションユニット(AU)は、きめ細かい表情をモデル化するためのCLIP内のテキストプロンプトである。私たちはCLIPに解釈可能なAUセマンティクスを統合する軽量なAU誘導時間学習手法であるCLIP-AUを紹介する。また,ビデオベースのテスト時間パーソナライズ手法であるCLIP-AUTTを提案する。
参考スコア（独自算出の注目度）: 57.8548595493709
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER.
Abstract（参考訳）: 感情認識(ER)のパーソナライゼーションは、微妙で主観的な表現パターンの正確な解釈に不可欠である。 CLIPのような視覚言語モデル(VLM)の最近の進歩は、ERにおける共同画像テキスト表現を活用する強力な可能性を示している。しかし、CLIPベースの手法は、CLIPの対照的な事前学習に依存するか、あるいは記述的なテキストプロンプトを生成するためにLLMに依存する。本研究では、CLIP内の構造化テキストプロンプトとしてアクションユニット(AU)を活用し、きめ細かい表情をモデル化する。 AUは、表現の基礎となる微妙な筋肉の活性化を符号化し、より堅牢なERに対して局所的で解釈可能な意味的手がかりを提供する。私たちはCLIPに解釈可能なAUセマンティクスを統合する軽量なAU誘導時間学習手法であるCLIP-AUを紹介する。 AUプロンプトを顔のダイナミックスと整列させることで、汎用的で主題に依存しない表現を学習し、CLIPの微調整やLLM生成したテキストの監督なしに細粒のERを可能にする。 CLIP-AUは微粒なAUセマンティクスをモデル化するが、微妙な表現では主観的な変動に適応しない。この制限に対処するために,ビデオベースのテスト時間パーソナライズ手法であるCLIP-AUTTを提案する。エントロピー誘導による時間的ウィンドウ選択と即時チューニングを組み合わせることで、CLIP-AUTTは時間的一貫性を維持しながら主題固有の適応を可能にする。ビデオベースの微妙なERデータセットであるBioVid、ScressID、BAHに関する大規模な実験は、CLIP-AUとCLIP-AUTTが最先端のCLIPベースのFERとTTAメソッドより優れており、堅牢でパーソナライズされた微妙なERを達成することを示唆している。

論文の概要: CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

関連論文リスト