Fugu-MT 論文翻訳(概要): 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

論文の概要: 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

arxiv url: http://arxiv.org/abs/2605.17046v2
Date: Tue, 19 May 2026 07:26:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.476214
Title: 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
Title（参考訳）: 1GC-7RC: 1枚のグラフィックカード - 7つの研究課題!
Authors: Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald,
Abstract要約: **1GC-7RC*は、言語モデリング、画像分類、セマンティックセグメンテーション、グラフ学習、テキスト分類にまたがる7つのMLタスクのベンチマークである。各タスクは、ベースライントレーニングスクリプトとともに、ロックされたデータ準備および評価スクリプトを提供する。ベンチマーク、ハーネス、すべての評価アーティファクトはGitHubで公開されている。
参考スコア（独自算出の注目度）: 7.781391987352844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.
Abstract（参考訳）: 自律的なAIコーディングエージェントは、業界や研究におけるML実践者の中核的なツールになりつつある。この採用の増加にもかかわらず、さまざまなドメインでスクラッチからモデルを設計、実装、トレーニングする能力を評価するための標準ベンチマークは存在しない。言語モデリング、画像分類、セマンティックセグメンテーション、グラフ学習、表形式予測、時系列予測、テキスト分類にまたがる7つのMLタスクからなるベンチマークである**1GC-7RC*(*Single Graphic Card: Seven Research Challenges*)を紹介する。各タスクは、ベースラインのトレーニングスクリプトとともにロックされたデータ準備および評価スクリプトを提供する。エージェントはトレーニングコードのみを変更し、事前訓練された重み付けへのアクセス(セマンティックセグメンテーションの制御された例外が1つある)がなく、インターネットアクセスがなく、1つのGPU上でタスク固有のウォールクロック予算(40-120分)でタスクを完了しなければならない。 5つのプロプライエタリなコーディングエージェント(Sonnet 4.6、Opus 4.6、Opus 4.7、GPT 5.5のCodex CLI、Qwen 3.6+のOpenCode)と2つのオープンソース(Kim K2.5、Kim K2.6のOpenCode)を評価します。エージェントとタスクのペア毎に5回にわたって実行し、暗黙的なML知識、計画能力、時間予算管理のレベルが異なるパフォーマンスの違いを報告します。ベンチマーク、ハーネス、すべての評価アーティファクトは、将来のエージェントの再現可能な比較を容易にするために、https://github.com/Strolchii/1GC-7RC-BenchmarkでGitHubで公開されている。ベンチマーク設計はモジュール化されているため、ベンチマークは新たなタスクやドメインに拡張でき、GPU予算に適合し、マルチエージェント設定の研究に使用される。

論文の概要: 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

関連論文リスト