Fugu-MT 論文翻訳(概要): Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

論文の概要: Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

arxiv url: http://arxiv.org/abs/2511.00389v1
Date: Sat, 01 Nov 2025 03:53:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.751794
Title: Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond
Title（参考訳）: マルチモーダル大規模言語モデルにおける表情認識の再考:ベンチマーク、データセットなど
Authors: Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng,
Abstract要約: MLLMの表情推論能力の向上を目的とした後学習戦略を提案する。 We developed a unified and interpretable FER foundation model called UniFER-7B。
参考スコア（独自算出の注目度）: 116.65158801881984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、コンピュータビジョンや感情コンピューティングを含む多くの研究分野に革命をもたらした。この学際領域における重要な課題として、表情認識(FER)は、独立したドメイン固有のモデルからより統一されたアプローチへと進化してきた。 FERタスクを統一する有望な方法の1つは、従来のFERデータセットを視覚的質問応答(VQA)フォーマットに変換することで、推論に強力なジェネラリストMLLMを直接適用できるようにすることである。しかし, 各種タスクにおける最先端MLLMの成功にもかかわらず, FERタスクの性能は未解明のままである。このギャップに対処するため、私たちは4つの広く使用されているFERデータセットに20の最先端MLLMを組み込んだ、系統的なベンチマークであるFERBenchを提供しています。その結果,MLLMの分類性能は良好であるが,推理性や解釈性には大きな限界があることが判明した。そこで本稿では,MLLMの表情推論能力の向上を目的とした後学習戦略を提案する。具体的には、コールドスタート初期化のためのUniFER-CoT-230Kと、検証可能な報酬付き強化学習のためのUniFER-RLVR-360Kの2つの高品質および大規模データセットをキュレートする。そこで我々は、UniFER-7Bと呼ばれる統一的で解釈可能なFER基盤モデルを構築し、オープンソースおよびクローズドソースのMLLM(例:Gemini-2.5-Pro、Qwen2.5-VL-72B)よりも優れている。

論文の概要: Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

関連論文リスト