Fugu-MT 論文翻訳(概要): AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

論文の概要: AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

arxiv url: http://arxiv.org/abs/2509.21891v1
Date: Fri, 26 Sep 2025 05:28:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.200446
Title: AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans
Title（参考訳）: AgentPack: エージェントと人間が共同で認証するコード変更のデータセット
Authors: Yangtian Zi, Zixuan Wu, Aleksander Boruch-Gruszecki, Jonathan Bell, Arjun Guha,
Abstract要約: コード編集のための微調整された大きな言語モデルは、一般的にコミットのマイニングやプルリクエストに依存しています。我々は、Claude Code、OpenAI Codex、Cursor Agentが共著した1.3Mコード編集コーパスであるAgentPackを紹介する。 AgentPackで微調整されたモデルは、以前の人間のみのコミットコーパスで訓練されたモデルより優れていることを示す。
参考スコア（独自算出の注目度）: 46.56091965723774
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Fine-tuning large language models for code editing has typically relied on mining commits and pull requests. The working hypothesis has been that commit messages describe human intent in natural language, and patches to code describe the changes that implement that intent. However, much of the previously collected data is noisy: commit messages are terse, human-written commits commingle several unrelated edits, and many commits come from simple, rule-based bots. The recent adoption of software engineering agents changes this landscape. Code changes co-authored by humans and agents tend to be more narrowly scoped and focused on clearer goals. Their commit messages, generated by LLMs, articulate intent and rationale in much greater detail. Moreover, when these changes land in public repositories, they are implicitly filtered by humans: maintainers discard low-quality commits to their projects. We present AgentPack, a corpus of 1.3M code edits co-authored by Claude Code, OpenAI Codex, and Cursor Agent across public GitHub projects up to mid-August 2025. We describe the identification and curation pipeline, quantify adoption trends of these agents, and analyze the structural properties of the edits. Finally, we show that models fine-tuned on AgentPack can outperform models trained on prior human-only commit corpora, highlighting the potential of using public data from software engineering agents to train future code-editing models.
Abstract（参考訳）: コード編集のための微調整された大きな言語モデルは、一般的にコミットのマイニングやプルリクエストに依存しています。動作する仮説は、コミットメッセージは自然言語における人間の意図を記述し、コードへのパッチは、その意図を実装する変更を記述している、というものだ。コミットメッセージは簡潔で、人間によるコミットはいくつかの無関係な編集に始まり、多くのコミットは単純なルールベースのボットから来ている。ソフトウェアエンジニアリングエージェントの採用により、この状況が変わりました。人間とエージェントが共著したコードの変更は、スコープが狭くなり、より明確な目標に集中する傾向がある。 LLMによって生成されたコミットメッセージは、より詳細な意図と合理性を明確に表現する。さらに、これらの変更がパブリックリポジトリに着陸すると、人間によって暗黙的にフィルタリングされる:メンテナは、プロジェクトに対する低品質のコミットを捨てる。我々は、Claude Code、OpenAI Codex、Cursor Agentが共著した1.3Mコード編集コーパスであるAgentPackを、2025年8月中旬までのパブリックGitHubプロジェクト全体で紹介する。識別とキュレーションのパイプラインを記述し、これらのエージェントの採用動向を定量化し、編集の構造的特性を解析する。最後に、AgentPackで微調整されたモデルは、人間のみのコミットコーパスでトレーニングされたモデルよりも優れており、将来のコード編集モデルをトレーニングするために、ソフトウェアエンジニアリングエージェントから公開データを使用することの可能性を強調します。

論文の概要: AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

関連論文リスト