Fugu-MT 論文翻訳(概要): Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

論文の概要: Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

arxiv url: http://arxiv.org/abs/2606.08867v2
Date: Sat, 13 Jun 2026 05:21:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 18:36:04.814884
Title: Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework
Title（参考訳）: 顧客サポートAIエージェントを1億ユーザ規模で構築する - 評価駆動フレームワーク
Authors: Aman Gupta, Kevin Rossell, Edesio Alcobaça, Jose Chrystian Lima Pacheco, Carolina Baptista de Lima, Shao Tang, Luiz Paulo Rabachini, Luis Moneda, Herbert Fei, Daniel Silva, Rohan Ramanath,
Abstract要約: 我々は,NubankのカスタマーサポートAIエージェントに対して,オフライン開発とオンライン影響を橋渡しする統合フレームワークを提案する。中心的な洞察は、評価パイプラインの品質がイテレーションのベロシティを直接決定することです。ほとんどのユースケースでは、AIの満足度は専門家の人間エージェントの数パーセント以内に達します。
参考スコア（独自算出の注目度）: 2.29541210878158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.
Abstract（参考訳）: LLM能力の急速な向上により、AIエージェントは幅広いタスクでますます有効になっている。もっとも有望な応用としては、製品対応の顧客エージェントの構築、評価方法論のコーディネートな卓越性、コンテキストエンジニアリング、トレーニング、オンライン測定などが挙げられる。しかし、これらの重要な柱は通常独立して開発され、デプロイ後にのみ現れる盲点を生み出します。本稿では,1億人以上のユーザを抱えるNubankにおいて,顧客支援AIエージェントに対するオンライン影響でオフライン開発を橋渡しする統合フレームワークを提案する。提案手法は,(1)カスタマーサポートエージェントに適した構造化コンテキスト工学,(2)系統的ヒューマン・イン・ザ・ループ・プロンプト・イテレーション,(3)レータ間合意による厳密なLCM判定,(4)計画の整合性のためのEPA最適化,(4)計画の検証など,いくつかの重要な要素を統合している。中心的な洞察は、評価パイプラインの品質がイテレーションのベロシティを直接決定することです。我々は、カード配信、債務管理、クレジット・リミット・サポート、カード管理、製品説明の5つの異なる領域にまたがる製品展開の結果を提示する。これらのデプロイメントは、一貫した顧客満足度を高めながら、イテレーションを大幅に加速します。当社のカード配信デプロイメントでは、大規模なA/Bテストは、AIトランザクショナルネットプロモーターScoreの37パーセント改善と、前エージェントの変種よりも29パーセントのセルフサービス率向上をもたらし、オフラインシミュレーションメトリクスとオンライン成果の相関が強く、eval駆動開発がプロダクションへの影響を確実に予測していることを示しています。ほとんどのユースケースでは、AIの満足度は専門家の人間エージェントの数パーセント以内に達します。

論文の概要: Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

関連論文リスト