Fugu-MT 論文翻訳(概要): LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps

論文の概要: LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps

arxiv url: http://arxiv.org/abs/2604.27345v2
Date: Fri, 01 May 2026 01:59:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.932841
Title: LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps
Title（参考訳）: LLMs Capture Emotion Labels, not Emotion Uncertainty: Distributional Analysis and Calibration of Human-LLM Judgment Gaps
Authors: Keito Inoshita, Xiaokang Zhou, Akira Kawai, Katsutoshi Yada,
Abstract要約: 人間のアノテータは感情ラベルによく反対する。 LLM(Large Language Model)の感情アノテーションのほとんどの評価は、これらの判断を単一のゴールド標準に分解する。分布ギャップを最大14%削減する3つの軽量ポストホックキャリブレーション法を提案する。
参考スコア（独自算出の注目度）: 10.159901538172575
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.
Abstract（参考訳）: 人間のアノテータは感情ラベルにはしばしば反対するが、Large Language Model (LLM) の感情アノテーションのほとんどの評価は、これらの判断を単一のゴールド標準に分解し、不一致が符号化する分布情報を捨てる。我々は、人間のアノテータと4つのゼロショットLDMの感情判断分布と、GoEmotionsとEmoBankの2つの相補的なベンチマークであるRoBERTaベースラインとを比較して、この不一致構造を、多数ラベルだけでなく捉えているかどうかを問う。ゼロショットモデルは人間の分布から大きく分岐し、そのギャップを埋めるためには、モデルスケールではなくドメイン内の微調整が必要である。 LLMは明示的な語彙マーカーで感情を確実に捉えるが、文脈推論を必要とする現実的な複雑な感情は体系的に失敗する。さらに、3つの軽量なポストホックキャリブレーション手法を提案し、最大14\%の分散ギャップを減らし、LLMの感情アノテーションが人間のラベルに代えて、可能で、不可能な場合に、実用的なガイドラインを提供する。

関連論文リスト

Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking [0.03219669032304847]
ヒトラベル変量(HLV)は、大規模な言語モデルの開発が急速に進んでいるにもかかわらず、ベンチマークでは未熟である。本研究では,(1) ラベルの一致と(2) 意見の不一致の2つの条件を明示的に考慮したマルチモーダルな大規模言語モデルベンチマークの評価プロトコルを提案する。
論文参考訳（メタデータ） (2026-03-20T08:29:05Z)
VoxEmo: Benchmarking Speech Emotion Recognition with Speech LLMs [54.75016325571445]
音声大言語モデル (LLM) は, 音声の感情認識において, 生成インタフェースを介する大きな可能性を示す。クローズドセットからオープンテキスト生成へのシフトは、ゼロショット性を導入し、プロンプトに非常に敏感な評価を与える。本稿では,VoxEmoについて紹介する。VoxEmoは音声LLMのための15言語に35の感情コーパスを含む総合的なSERベンチマークである。
論文参考訳（メタデータ） (2026-03-09T21:10:34Z)
Are Multimodal Large Language Models Good Annotators for Image Tagging? [62.01475514488922]
本稿では,MLLMの生成するアノテーションと人間のアノテーションのギャップを分析することを目的とする。本稿では,MLLM生成アノテーションと人間のアノテーションのギャップを狭めることを目的とした,画像タグ付けのための新しいフレームワークであるTagLLMを提案する。
論文参考訳（メタデータ） (2026-02-24T14:53:16Z)
Scaling Ambiguity: Augmenting Human Annotation in Speech Emotion Recognition with Audio-Language Models [14.458242760193203]
音声感情認識モデルは典型的には、人間の感情の本質的なあいまいさを覆い隠して、単一の分類ラベルを使用する。本稿では,高品質な合成アノテーションを生成することで,ALM(Large Audio-Language Models)がアノテーションボトルネックを軽減することができるかを検討する。本稿では,ALMを利用してSynthetic Perceptual Proxiesを作成するフレームワークを提案する。
論文参考訳（メタデータ） (2026-01-21T03:32:24Z)
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection [5.731621080995591]
ヘイトスピーチはオンラインで広く普及し、個人やコミュニティを害し、大規模なモデレーションに欠かせない自動検出を可能にしている。問題の一部は主観性にある: ある人が憎しみの言葉としてフラグを付けることは、別の人が良心と見なすかもしれない。大規模言語モデル(LLM)は拡張性のあるアノテーションを約束するが、以前の研究では、人間の判断を完全に置き換えることはできないことが示されている。
論文参考訳（メタデータ） (2025-12-10T14:00:48Z)
Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier [53.55996102181836]
本稿では,感情関係検証器 (ERV) と説明リワードを提案する。本手法は,対象感情と明確に一致した推論をモデルに導出する。我々のアプローチは、説明と予測の整合性を高めるだけでなく、MLLMが感情的に一貫性があり、信頼できる対話を実現するのにも役立ちます。
論文参考訳（メタデータ） (2025-10-27T16:40:17Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
ヒトのアノテーションの変化(つまり不一致)は、NLPでは一般的である。異なる推論条件が大言語モデルの不一致モデルに与える影響を評価する。意外なことに、RLVRスタイルの推論は不一致モデリングにおいて性能を低下させる。
論文参考訳（メタデータ） (2025-06-24T09:49:26Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
大規模言語モデル(LLM)を用いて順序付けアノテーションと説明を再スケールする手法を提案する。我々は、アノテータのLikert評価とそれに対応する説明をLLMに入力し、スコア付けルーリックに固定された数値スコアを生成する。提案手法は,合意に影響を及ぼさずに生の判断を再スケールし,そのスコアを同一のスコア付けルーリックに接する人間の判断に近づける。
論文参考訳（メタデータ） (2023-05-24T06:19:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。