Fugu-MT 論文翻訳(概要): Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

論文の概要: Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

arxiv url: http://arxiv.org/abs/2606.12608v1
Date: Wed, 10 Jun 2026 19:04:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.420557
Title: Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants
Title（参考訳）: Shopping Reasoning Bench: マルチTurn会話型ショッピングアシスタントのエキスパート認証ベンチマーク
Authors: Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin,
Abstract要約: 既存のベンチマークでは、実際のショッピング会話が要求するオープンエンドのマルチターン推論、ドメインの専門知識、基準レベルの品質を共同評価していない。 Shopping Reasoning Benchは、525のミッション(232のシングルターン、293のマルチターン)のエキスパートによるベンチマークで、小売ドメインの専門家が作成した10863の重み付きバイナリルーブリックを紹介します。
参考スコア（独自算出の注目度）: 24.455456910655254
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.
Abstract（参考訳）: 会話型ショッピングアシスタントは現在、数億の顧客にサービスを提供しているが、既存のベンチマークでは、実際のショッピング会話が要求するオープンエンドのマルチターン推論、ドメインの専門知識、基準レベルの品質を共同評価していない。ショッピング推論は言語モデルアプリケーションに特有のものだ。現実の質問応答や検証可能なコード生成とは異なり、主観的な選好、予算の制約、多ターン対話における製品間のトレードオフのバランス、以前のeコマースや汎用ベンチマークから欠落する機能などが必要である。 Shopping Reasoning Benchは、525のミッション(232のシングルターン、293のマルチターン)のエキスパートによるベンチマークで、小売ドメインの専門家が作成した10863の重み付きバイナリルーブリックを紹介します。これらの基準は、選好の洗練、トレードオフ分析、互換性評価などの様々な要求をカバーする5つの推論カテゴリと15のサブカテゴリの分類の下に編成されている。 3つのファミリー(GPT, Claude, Gemini)にまたがる9つのモデルの評価では、パスレートは全体の57～77%に過ぎなかった。マルチターンミッションでは、すべてのモデルが、必要なものよりもオプション上の基準で13--29ポイント低く、会話が進むにつれてパフォーマンスは4--18ポイント低下する。これらのギャップは、現在のモデルが基本的なショッピング支援を扱うが、専門家レベルのアドバイスを欠いていることを示している。

論文の概要: Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

関連論文リスト