Fugu-MT 論文翻訳(概要): DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

論文の概要: DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

arxiv url: http://arxiv.org/abs/2601.16344v1
Date: Thu, 22 Jan 2026 22:03:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-26 14:27:27.420889
Title: DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
Title（参考訳）: DSGym: データサイエンスエージェントの評価とトレーニングのための全体論的なフレームワーク
Authors: Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou,
Abstract要約: DSGymは、自己完結型実行環境におけるデータサイエンスエージェントの評価とトレーニングのための標準化されたフレームワークである。静的ベンチマークとは異なり、DSGymは、タスクやエージェントの足場、ツールを簡単に追加し、それをライブのテストベッドとして配置するモジュールアーキテクチャを提供する。 2,000サンプルのトレーニングセットを構築し,標準解析ベンチマークでGPT-4oを上回ったDSGymの4Bモデルを訓練した。
参考スコア（独自算出の注目度）: 38.72287521565312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
Abstract（参考訳）: データサイエンスエージェントは、データを実行可能な分析と発見に変換することで、発見と洞察生成を加速することを約束する。しかし、既存のデータサイエンスベンチマークは、クロスベンチマーク比較が難しく、タスクカバレッジが狭く、厳密なデータグラウンディングの欠如が原因で、断片的な評価インターフェースが不足している。特に、現在のベンチマークにおけるタスクのかなりの部分は、実際のデータを用いることなく解決可能であることを示す。これらの制約に対処するため,DSGymは,自己完結型実行環境におけるデータサイエンスエージェントの評価と訓練のための標準化されたフレームワークである。静的ベンチマークとは異なり、DSGymは、タスクやエージェントの足場、ツールを簡単に追加し、それをライブで拡張可能なテストベッドとして配置するモジュールアーキテクチャを提供する。我々はDSGym-Tasksをキュレートする。DSGym-Tasksは、品質と短時間の可解性フィルタリングによって既存のベンチマークを標準化し、洗練する総合的なタスクスイートである。さらに, (1) DSBio: 文献に基づく専門家由来のバイオインフォマティクスタスク, (2) DS予測: コンピュータビジョン, 分子予測, 単細胞摂動といった領域にまたがる課題を予測する。 DSGymは評価以外にも、実行検証データ合成パイプラインを通じてエージェントのトレーニングを可能にする。ケーススタディとして,2000サンプルのトレーニングセットを構築し,標準解析ベンチマークでGPT-4oを上回る4BモデルをDSGymでトレーニングした。全体としてDSGymは、エージェントが現実的な科学的文脈でデータ分析を計画、実装、検証できるかどうかを厳密なエンドツーエンドで測定することができる。

論文の概要: DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

関連論文リスト