Fugu-MT 論文翻訳(概要): Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

論文の概要: Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

arxiv url: http://arxiv.org/abs/2604.22517v1
Date: Fri, 24 Apr 2026 12:56:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.459632
Title: Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Title（参考訳）: Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Authors: Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki,
Abstract要約: 分析では、微粒な順序のスコアについてかなりの専門家の意見の相違が示され、一方、合意は粗い選択の下で高い。次に、ゼロショット判定器、混合評価器の履歴に規定された集計判定器、対象評価器のスコアリング履歴に規定されたパーソナライズされた判定器の3つの構成を比較した。パーソナライズド・ジャッジは、次元やモデルサイズ全体にわたって、アグリゲーション・ジャッジよりも対応する評価者とより緊密に一致し、評価者合意はパーソナライズド・コンディショニング(パーソナライズド・コンディショニング)の下でのみ、判断生成推論の類似性と相関する。
参考スコア（独自算出の注目度）: 4.814048071575166
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
Abstract（参考訳）: LLMの生成するビジネスアイデアを評価することは、生成するよりもスケールアップが難しい場合が多い。標準的なNLPベンチマークとは異なり、ビジネスアイデア評価は実現可能性、ノベルティ、差別化、ユーザニーズ、市場規模といった多次元的な基準に依存しており、専門家による判断はしばしば一致しない。本稿では,このような意見の相違によって提起された方法論的問題について考察する。自動判断は集合的コンセンサスを近似するか,あるいはモデル評価器を個別に近似すべきか? PBIG-DATA(PBIG-DATA)は,特許を根拠とした300の製品アイデアに対して,ドメインの専門家が6つのビジネス指向のディメンション – 特異性,技術的妥当性,革新性,競争優位性,市場規模 – に対して,約3,000の個人スコアのデータセットである。分析は微粒な順序のスコアについてかなりの専門家の意見の相違を示し、一方一致は粗い選択の下で高く、ランダムノイズよりも構造的不均一性を示している。次に,3つの判定構成を比較した。ルーリックのみのゼロショット判定器,混合評価器の履歴を条件とした集計判定器,対象評価器のスコアリング履歴を条件としたパーソナライズされた判定器である。パーソナライズド・ジャッジは、次元やモデルサイズ全体にわたって、アグリゲーション・ジャッジよりも対応する評価者とより緊密に一致し、評価者合意はパーソナライズド・コンディショニング(パーソナライズド・コンディショニング)の下でのみ、判断生成推論の類似性と相関する。これらの結果から, プールラベルは多元的評価設定において脆弱なターゲットとなり, ビジネスアイデア評価のための評価者条件の判断設計を動機付ける可能性が示唆された。

論文の概要: Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

関連論文リスト