Fugu-MT 論文翻訳(概要): AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

論文の概要: AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

arxiv url: http://arxiv.org/abs/2605.24183v1
Date: Fri, 22 May 2026 20:16:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:17.728558
Title: AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery
Title（参考訳）: AvalancheBench: 潜伏した世界回復を通じてエンタープライズデータエージェントを評価する
Authors: Darek Kleczek, Fuheng Zhao, Alexander W. Lee, Julien Tissier, Pawel Liskowski, Ugur Cetintemel, Anupam Datta,
Abstract要約: AvalancheBenchは、Emphlatent World Recoveryを通じてエンタープライズデータエージェントを評価するためのベンチマークパイプライン補完よりも分析的理解を評価する。既知の潜伏世界から観測結果を生成し、不完全だが有効な回復のための部分的な信用を可能にする。
参考スコア（独自算出の注目度）: 36.56945581753333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.
Abstract（参考訳）: 本稿では,企業データエージェント評価のためのベンチマークであるAvalancheBenchを紹介する。 AvalancheBenchは既存のベンチマークを3つの方法で改善する。まず、パイプラインの補完よりも分析的な理解を評価する。システムは、ワークフローを実行するか、あるいは妥当なレポートを生成するかではなく、セグメント、ドライバ、時間的イベント、データを説明する関係を回復するかどうかに基づいてスコア付けされる。第2に、既知の潜伏世界から観測結果を生成し、不完全だが有効な回復のための部分的信用を可能にすることで、ゴール駆動分析の基礎となる真実を提供する。第3に、初期の分析ミスが後続の結論にどのように伝播するかを明らかにしている。セグメントの欠落、統合イベント、誤った属性は、体系的に間違ったレコメンデーションをもたらす可能性がある。この意味で、AvalancheBenchは、エージェントがエンタープライズデータの背後にある分析構造を回復するかどうかを診断するための制御された設定を提供することで、実データベンチマークを補完する。最初のeコマースのユースケースでは、リードするコーディングエージェントの最も強い構成は、一般的な顧客セグメンテーションと統合された一時的なイベントに集中して、ルーブリックの26倍しか回復しない。

論文の概要: AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

関連論文リスト