Fugu-MT 論文翻訳(概要): SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

論文の概要: SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

arxiv url: http://arxiv.org/abs/2605.08334v1
Date: Fri, 08 May 2026 17:59:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.575879
Title: SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
Title（参考訳）: SalesSim: ユーザシミュレータとしてのマルチモーダル言語モデルのベンチマークとアライメント
Authors: Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu,
Abstract要約: 本稿では,マルチモーダル大規模言語モデル(MLLM)の現実的,ペルソナ主導の顧客行動のシミュレート能力を評価するためのフレームワークとテストベッドであるSalesSimを紹介する。我々は,シミュレータの動作とペルソナ仕様との整合性,および会話品質を測定する。実験により,UserGRPOは,会話品質を改善しつつ,ベースラインモデルの整合性を13.8%向上させることを示した。
参考スコア（独自算出の注目度）: 63.68151307455963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.
Abstract（参考訳）: そこで我々は,マルチモーダル大規模言語モデル(MLLM)によるマルチターン・マルチモーダル・ツール強化オンライン小売会話における現実的,ペルソナ駆動の顧客行動のシミュレートを行うためのフレームワークとテストベッドであるSalesSimを提案する。ユーザーシミュレーションを表面レベルの対話生成として扱う以前の作業とは異なり、SalesSimは、さまざまなバックグラウンド、好み、ディールブレーカーを持つ買い物客がセールスエージェントと対話し、明確化を求め、情報的な意思決定を行う、基礎的かつエージェント的なプロセスとして、インタラクションと意思決定を小売する。評価のために,シミュレータの動作とペルソナ仕様との整合性,および会話品質を測る,意思決定アライメントを中心としたメトリクスセットを設計する。オープンおよびクローズド・ソース・オブ・ザ・アーティファクト・モデル6をベンチマークした結果,いくつかの動作ギャップが明らかになった。第一に、モデルは流動的な会話を生成するが、人間の会話と比較して、語彙の多様性と人格間の基準の過剰な開示は著しく低い。第二に、モデルは販売業者の提案によって説得され、ペルソナ仕様から逸脱する傾向にある。最強のモデルでさえ、その基礎となるペルソナ仕様との平均アライメントは79%以下である。これらの制限を進展させるために,ペルソナ仕様に基づく会話の流速と意思決定の整合性の両方を最適化する多ターン多目的強化学習法であるUserGRPOを提案する。実験により,UserGRPOは,会話品質を改善しつつ,ベースラインモデルの整合性を13.8%向上させることを示した。そこで我々は,SalesSimを導入することで,目標設定におけるユーザシミュレータの定着度を調査・改善するための新しいテストベッドを提供する。

論文の概要: SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

関連論文リスト