Fugu-MT 論文翻訳(概要): Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

論文の概要: Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

arxiv url: http://arxiv.org/abs/2604.07054v1
Date: Wed, 08 Apr 2026 13:06:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.54416
Title: Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Title（参考訳）: LLMのリアルな販売スキルのベンチマーク
Authors: Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang,
Abstract要約: SalesLLMは、金融サービスと消費者製品をカバーする現実的なアプリケーションのベンチマークである。 SalesLLMのスコアは、専門家の人間格付けと強く相関している。成果指向の販売エージェントの開発と評価のためのスケーラブルなベンチマークとして機能する。
参考スコア（独自算出の注目度）: 1.1559341355776336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.
Abstract（参考訳）: セールスダイアログは非対称なインセンティブの下で多ターンでゴール指向の説得を必要とするため、大きな言語モデル(LLM)では困難な設定となっている。しかし、既存の対話ベンチマークは、取引の進行と結果を測定することはめったにない。筆者らは,30,074のスクリプト構成と1,805のキュレートされたマルチターンシナリオから構築した,金融サービスと消費者商品をカバーする現実的なアプリケーションから派生したバイリンガル(ZH/EN)ベンチマークであるSalesLLMを紹介する。組み合わせた完全自動評価パイプラインを提案する。一販売プロセスの進捗のためのLLMベースのレーダ、及び (ii)エンド・オブ・ダイアログ購入意図のための細調整BERT分類器。シミュレーション忠実度を向上させるため,SFTとDPOによるユーザモデルであるCustomerLMを,8000人のクラウドワーカーが関与する営業会話でトレーニングし,役割のインバージョンを17.44%(GPT-4o)から8.8%に短縮した。セールスLLMスコアは、専門家の人間格付けと強く相関している(Pearson r=0.98)。トップパフォーマンスのLSMは人間レベルのパフォーマンスと競合するが、能力の低いLSMは人間よりも悪い。 SalesLLMは、成果指向のセールスエージェントの開発と評価のためのスケーラブルなベンチマークとして機能する。

論文の概要: Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

関連論文リスト