Fugu-MT 論文翻訳(概要): Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

論文の概要: Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

arxiv url: http://arxiv.org/abs/2605.17079v1
Date: Sat, 16 May 2026 16:55:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.599069
Title: Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
Title（参考訳）: LLMは消費者のように見えるか?ConsumerSimBenchによるクラウドレベルリアクションのベンチマーク
Authors: Tianyu Wang, Jiajun Li, Jianghao Lin,
Abstract要約: このベンチマークは、1,553の中国のソーシャルメディアトピックと23,122のアトミックなルール監査基準から構築されている。包括的選好判断でオープンエンド世代を評価するのではなく、ConsumerSimBenchは各タスクを具体的な反応点に対する監査可能なイエスノー決定に分解する。
参考スコア（独自算出の注目度）: 19.108409101323605
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.
Abstract（参考訳）: LLMは、世論をシミュレートし、マーケティング決定を事前テストし、聴衆の反応を期待するために「デジタル消費者」として使われるようになっている。しかし、既存の評価では、実際の消費者が公衆の言論で表す具体的な反応パターンをモデルが再構築できるかどうかを問うことは稀である。このベンチマークは、1,553の中国のソーシャルメディアトピックと23,122のアトミックなルール監査基準から構築され、4つのリアクションファミリにまたがるベンチマークである。消費者シムベンチは、包括的選好判断でオープンエンド世代を採点する代わりに、各タスクを具体的な反応点に対する監査可能なイエスノー決定に分解し、3つのジャッジ合意を65.8%から92.1%に引き上げ、ポイントワイド判断とヒューマンマジョリティラベルの98.4%の合意を得た。最強のジェミニ-3.1-Proである13基のフロンティア・ジェネレータは、実際の反応基準の47.8%しかカバーしていないが、GPT-5.2とクロード-4.6は技術ベンチマークの強さにもかかわらずはるかに遅れている。この失敗は、技術的ベンチマークのパフォーマンスと、社会的に根ざした消費者の直感の間に大きなギャップをあけている。直接構造推論はカバー範囲を減少させ、生成-反射型マルチエージェントパイプラインはMiMo-V2.5-Proを32.9%から37.6%に改善する。 ConsumerSimBenchは、消費者のシミュレーションを実際の公開談話に対する予測問題と再定義し、フロンティアのLSMは、消費者が高文脈の中国の消費者談話で実際に何を気にかけるかを確実に予測するには程遠いことを示している。

論文の概要: Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

関連論文リスト