Fugu-MT 論文翻訳(概要): Are Coding Agents Generating Over-Mocked Tests? An Empirical Study

論文の概要: Are Coding Agents Generating Over-Mocked Tests? An Empirical Study

arxiv url: http://arxiv.org/abs/2602.00409v1
Date: Fri, 30 Jan 2026 23:55:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 14:57:52.421216
Title: Are Coding Agents Generating Over-Mocked Tests? An Empirical Study
Title（参考訳）: コーディングエージェントはオーバーモックテストを生成するか? : 実証研究
Authors: Andre Hora, Romain Robbes,
Abstract要約: コーディングエージェントは最近、ソフトウェア開発に大きく採用されている。本稿では,実世界のソフトウェアシステムのエージェント生成テストにおけるモックの存在について検討する。全体として、コーディングエージェントは、非コーディングエージェントよりもテストを変更し、テストにモックを追加する傾向にある。
参考スコア（独自算出の注目度）: 2.3625700564650347
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Coding agents have received significant adoption in software development recently. Unlike traditional LLM-based code completion tools, coding agents work with autonomy (e.g., invoking external tools) and leave visible traces in software repositories, such as authoring commits. Among their tasks, coding agents may autonomously generate software tests; however, the quality of these tests remains uncertain. In particular, excessive use of mocking can make tests harder to understand and maintain. This paper presents the first study to investigate the presence of mocks in agent-generated tests of real-world software systems. We analyzed over 1.2 million commits made in 2025 in 2,168 TypeScript, JavaScript, and Python repositories, including 48,563 commits by coding agents, 169,361 commits that modify tests, and 44,900 commits that add mocks to tests. Overall, we find that coding agents are more likely to modify tests and to add mocks to tests than non-coding agents. We detect that (1) 60% of the repositories with agent activity also contain agent test activity; (2) 23% of commits made by coding agents add/change test files, compared with 13% by non-agents; (3) 68% of the repositories with agent test activity also contain agent mock activity; (4) 36% of commits made by coding agents add mocks to tests, compared with 26% by non-agents; and (5) repositories created recently contain a higher proportion of test and mock commits made by agents. Finally, we conclude by discussing implications for developers and researchers. We call attention to the fact that tests with mocks may be potentially easier to generate automatically (but less effective at validating real interactions), and the need to include guidance on mocking practices in agent configuration files.
Abstract（参考訳）: コーディングエージェントは最近、ソフトウェア開発に大きく採用されている。従来のLLMベースのコード補完ツールとは異なり、コーディングエージェントは自主性(外部ツールの呼び出しなど)で作業し、コミットのオーサリングなどのソフトウェアリポジトリに痕跡を残します。それらのタスクの中で、コーディングエージェントは自動でソフトウェアテストを生成することがあるが、これらのテストの品質は依然として不明である。特にモックの過剰な使用は、テストの理解とメンテナンスを難しくする可能性がある。本稿では,実世界のソフトウェアシステムのエージェント生成テストにおけるモックの存在について検討する。コーディングエージェントによる48,563コミット、テストを修正する169,361コミット、テストにモックを追加する44,900コミットなどだ。全体として、コーディングエージェントは、非コーディングエージェントよりもテストを変更し、テストにモックを追加する傾向にある。 1) エージェント活性を有するリポジトリの60%がエージェントテスト活性を含んでおり, (2) コーディングエージェントによるコミットの23%は非エージェントによるファイルの追加/変更であり, (3) エージェントテスト活性を持つリポジトリの68%はエージェントモック活性を含んでおり, (4) コーディングエージェントによるコミットの36%は非エージェントによる26%と比較してテストにモックを加える。最後に、開発者と研究者への影響について論じる。私たちは、モックを使ったテストが自動生成しやすく(実際のインタラクションを検証するのに効果が低い)、エージェント設定ファイルにモックの実践に関するガイダンスを含める必要があるという事実に注意を払っている。

関連論文リスト

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
数百のタスクで並列評価をオーケストレーションする,標準化された評価ハーネスを提供する。モデル、足場、ベンチマークにまたがる3次元解析を行う。私たちの分析では、ほとんどのランで精度を低下させる高い推論努力など、驚くべき洞察が示されています。
論文参考訳（メタデータ） (2025-10-13T22:22:28Z)
Intention-Driven Generation of Project-Specific Test Cases [45.2380093475221]
検証意図の記述からプロジェクト固有のテストを生成するIntentionTestを提案する。 13のオープンソースプロジェクトから4,146件のテストケースで,最先端のベースライン(DA, ChatTester, EvoSuite)に対してIntentionTestを広範囲に評価した。
論文参考訳（メタデータ） (2025-07-28T08:35:04Z)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
我々はSWE-PolyBenchを紹介した。SWE-PolyBenchは、コードエージェントのリポジトリレベル、実行ベース評価のための新しいベンチマークである。 SWE-PolyBenchには21のリポジトリから2110のインスタンスが含まれており、Java(165)、JavaScript(1017)、TypeScript(729)、Python(199)のタスクが含まれており、バグ修正、機能追加、コードを含んでいる。実験の結果,現在のエージェントは言語間で不均一なパフォーマンスを示し,複雑な問題に対処しつつ,単純なタスクで高いパフォーマンスを示すことがわかった。
論文参考訳（メタデータ） (2025-04-11T17:08:02Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
本稿では,テストケースの生成と完成におけるモデルの能力を評価するためのベンチマークCLOVERを提案する。ベンチマークはタスク間でのコード実行のためにコンテナ化されています。
論文参考訳（メタデータ） (2025-02-12T21:42:56Z)
Commit0: Library Generation from Scratch [77.38414688148006]
Commit0は、AIエージェントにスクラッチからライブラリを書くよう促すベンチマークである。エージェントには、ライブラリのAPIを概説する仕様文書と、インタラクティブなユニットテストスイートが提供されている。 Commit0はまた、モデルが生成したコードに対して静的解析と実行フィードバックを受け取る、インタラクティブな環境も提供する。
論文参考訳（メタデータ） (2024-12-02T18:11:30Z)
Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios [13.949319911378826]
この調査は、500の現実のGitHubイシューで上位10のエージェントから4,892のパッチを評価した。一人のエージェントが支配的であり、170の問題が未解決であり、改善の余地があった。ほとんどのエージェントはコードの信頼性とセキュリティを維持し、新しいバグや脆弱性を避けた。一部のエージェントはコードの複雑さを増し、多くの重複を減らし、コードの臭いを最小限にした。
論文参考訳（メタデータ） (2024-10-16T11:33:57Z)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
ユーザ問題をテストケースに形式化するLLMベースのコードエージェントについて検討する。我々は人気のあるGitHubリポジトリに基づいた新しいベンチマークを提案し、現実世界の問題、地味なバグフィックス、ゴールデンテストを含む。コード修復用に設計されたコードエージェントは,テスト生成用に設計されたシステムの性能を上回っている。
論文参考訳（メタデータ） (2024-06-18T14:54:37Z)
Observation-based unit test generation at Meta [52.4716552057909]
TestGenは、アプリケーション実行中に観察された複雑なオブジェクトのシリアライズされた観察から作られたユニットテストを自動的に生成する。 TestGenは518のテストを本番環境に投入し、継続的統合で9,617,349回実行され、5,702の障害が見つかった。評価の結果,信頼性の高い4,361のエンドツーエンドテストから,少なくとも86%のクラスでテストを生成することができた。
論文参考訳（メタデータ） (2024-02-09T00:34:39Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。