Fugu-MT 論文翻訳(概要): SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

論文の概要: SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

arxiv url: http://arxiv.org/abs/2603.16124v1
Date: Tue, 17 Mar 2026 05:12:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.105313
Title: SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Title（参考訳）: SWE-QA-Pro:リポジトリレベルのコード理解のための代表的なベンチマークとスケーラブルなトレーニングレシピ
Authors: Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen,
Abstract要約: SWEQA-Proは,多種多様な長期リポジトリと実行可能な環境から構築されたベンチマークである。さらに,2段階のトレーニングレシピであるSupervised Fine-Tuning(SFT)とReinforcement Learning from AI Feedback(RLAIF)という,スケーラブルな合成データパイプラインを提案する。 SWE-QA-ProのGPT-4oを2.3ポイント超え、最先端モデルとのギャップを大幅に狭める。
参考スコア（独自算出の注目度）: 41.98672557723593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
Abstract（参考訳）: 複雑なソフトウェアエンジニアリングタスクを自動化するにはエージェントレベルのコード理解が不可欠だが、その分野には信頼性のあるベンチマークがない。既存の評価は、長い尾のトピックを見落とし、大きな言語モデル(LLM)が記憶された知識によって騙されるような一般的なリポジトリに依存していることが多い。そこで本研究では,SWE-QA-Proについて紹介する。課題駆動型クラスタリングによるトピックバランスの実施により、未表現のタスクタイプをカバーし、厳密な難易度校正プロセスを適用し、直接回答ベースラインで解決可能な質問をフィルタリングする。この結果、エージェントワークフローが直接応答(例えば、Claude Sonnet 4.5の13ポイントのギャップ)を著しく上回り、エージェントコードベース探索の必要性を確認するデータセットが生成される。さらに,このような複雑な動作に対するトレーニングデータの不足に対処するために,2段階のトレーニングレシピであるSupervised Fine-Tuning (SFT) とReinforcement Learning from AI Feedback (RLAIF) の2段階からなるスケーラブルな合成データパイプラインを提案する。このアプローチにより、小さなオープンモデルで効率的なツールの使用と推論を学ぶことができる。提案手法を応用したQwen3-8Bモデルは,SWE-QA-ProでGPT-4oを2.3ポイント超え,最先端のプロプライエタリモデルとのギャップを大幅に狭くし,評価の有効性とエージェントトレーニングワークフローの有効性を実証した。

論文の概要: SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

関連論文リスト