Fugu-MT 論文翻訳(概要): SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

論文の概要: SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

arxiv url: http://arxiv.org/abs/2605.22564v1
Date: Thu, 21 May 2026 14:45:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.307285
Title: SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
Title（参考訳）: SynAE: ツールケアエージェント評価のための合成データの品質測定フレームワーク
Authors: Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti,
Abstract要約: SynAEは、ツールコールエージェントの合成ベンチマークが、実際のデータトラジェクトリの特性を如何に再現し、強化するかを評価するための評価フレームワークである。我々は最近のエージェントベンチマークを用いてSynAEを評価し、現実的で制御された生成方式を用いて一般的な合成データ障害モードをテストする。
参考スコア（独自算出の注目度）: 18.71623023651951
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.
Abstract（参考訳）: 現在、ツール呼び出しエージェントは、入力コマンド、エージェント応答、関連するツール呼び出しを含む実行トレースの静的データセットで一般的に評価またはテストされている。例えば、機密データやプロプライエタリデータを含む場合や、包括的なテスト(特にデプロイ前)をサポートするには不十分すぎる場合があります。これらの設定では、実践者は、評価目的のために、実際のデータセットを合成データセットに置き換えたり、強化したりしています。重要な課題は、これらの合成データセットと実際のデータとの関係を定量化することだ。実データトラジェクトリの特性を再現・拡張する多ターン・ツールコールエージェントの総合ベンチマークの評価フレームワークであるSynAEを紹介する。 SynAEは、合成データの妥当性、忠実性、多様性を4つの尺度に分けて評価する。 (i)タスク命令及び中間応答 (ii)ツールコール (三)最終的な出力、及び (4)下流評価。我々は最近のエージェントベンチマークを用いてSynAEを評価し、現実的で制御された生成方式を用いて一般的な合成データ障害モードをテストする。 SynAEは、データの妥当性、忠実度、多様性のきめ細かいばらつきを検出し、合成データ品質を完全に特徴付けるのに1つの指標が十分でないことを示し、エージェントテストのための合成データの多軸評価を動機付けている。 SynAEのデモはhttps://synae-2026-synae-demo.static.hf.space/index.htmlで、コードはhttps://github.com/wsqwsq/SynAEで公開されている。

論文の概要: SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

関連論文リスト