Fugu-MT 論文翻訳(概要): Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

論文の概要: Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

arxiv url: http://arxiv.org/abs/2606.10933v1
Date: Tue, 09 Jun 2026 14:44:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.556528
Title: Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages
Title（参考訳）: Frontierのコーディングエージェントはメタプログラミングを使って不慣れなプログラミング言語に適応する
Authors: Aman Sharma, Sushrut Thorat, Paras Chopra,
Abstract要約: 我々は4つの難解なプログラミング言語上で6つの現代の符号化エージェントを評価する。最強のエージェントである Claude Opus 4.6 と GPT-5.4 xhigh は、しばしばターゲット言語を直接書くことを避けている。
参考スコア（独自算出の注目度）: 4.779196219827507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.
Abstract（参考訳）: LLMベースのコーディングエージェントは通常、一般的な言語、共通ライブラリ、パブリックリポジトリといった、よく知られたソフトウェア設定で評価される。これらのベンチマークは依然として重要であるが、言語自体が馴染みのないときにエージェントがどのように振る舞うかを隠すことができる。そこで我々は,ファイル編集,ローカル実行,隠蔽テストのグレーディングを含む逐次設定を用いて,4つの難解なプログラミング言語上の6つの現代の符号化エージェントを評価した。提案プロトコルは,SWE-Bench Verified や Terminal-Bench 2.0 などのエージェントベンチマークを,より狭い帯域に圧縮する手法である。最強のエージェントである Claude Opus 4.6 と GPT-5.4 xhigh は、しばしばターゲット言語を直接書くことを避けている。 BrainfuckとBefunge-98では、ターゲット言語コードを生成し、それらのジェネレータをローカルにデバッグするPythonプログラムを記述している。このメタプログラミング戦略の禁止は、大きなパフォーマンス低下を引き起こす。この戦略から抽出したテキストガイダンスは、より弱い剤を実質的に改善しない。対照的に、Opusから派生したジェネレータのためのPythonヘルパーコードは、解決されたベンチマークプログラムや隠れテストの答えがないため、同じ問題に対してSonnet 4.6とGPT-5.4 miniが大幅に改善され、Haiku 4.5は依然として低いままである。より多くのインタプリタ呼び出しと出力トークンは、より強力なエージェントを改善するが、元のパフォーマンスの近くに弱いエージェントを残し、これらのリソースがそれらを作成するよりも有用な戦略を増幅することを示している。これらの結果は、強力なコーディングエージェントがツール、フィードバック、ワークスペース状態を使用して、対象言語の動作モデルを構築することで、慣れていない言語に適応することを示す。メタプログラミングは最も明確なケースですが、より広いギャップは、ターゲット言語のルールの下で機能する戦略の構築とデバッギングです。

論文の概要: Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

関連論文リスト