Fugu-MT 論文翻訳(概要): AutoHarness: improving LLM agents by automatically synthesizing a code harness

論文の概要: AutoHarness: improving LLM agents by automatically synthesizing a code harness

arxiv url: http://arxiv.org/abs/2603.03329v1
Date: Tue, 10 Feb 2026 14:12:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.157631
Title: AutoHarness: improving LLM agents by automatically synthesizing a code harness
Title（参考訳）: AutoHarness: コードハーネスの自動合成によるLLMエージェントの改善
Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy,
Abstract要約: 最近のKaggle GameArenaチェス大会では、ジェミニ2.5-Flashの損失の78%が違法な動きによるものだった。本稿では,Gemini-2.5-Flashがこのようなコードハーネスを自動的に生成できることを実証する。その結果、コードポリシーは16のTextArena 1-playerゲームでGemini-2.5-ProやGPT-5.2-Highよりも平均的な報酬を受ける。
参考スコア（独自算出の注目度）: 12.769239134972269
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.
Abstract（参考訳）: ここ数年、言語モデルに大きな進歩があったが、エージェントとして使われると、そのようなモデルは、与えられた状態に最適なだけでなく、外部環境によって厳格に禁止されているアクションを実行しようとすることが多い。例えば、最近のKaggle GameArenaチェス大会では、Gemini-2.5-Flashの損失の78%が違法な動きによるものだった。多くの場合、このような失敗を防ぐために、手動で LLM の周りに "ハーネス" を書きます。本稿では,Gemini-2.5-Flashが,ゲーム環境からのフィードバックに応じて,少数の反復的コード改善ラウンドを用いて,このようなコードハーネスを自動で生成できることを実証する。その結果、145種類のTextArenaゲーム(1-playerと2-playerの両方)における全ての違法な動きを防ぎ、より小さなGemini-2.5-FlashモデルがGemini-2.5-Proのような大型モデルより優れている。テクニックを限界まで押し上げれば、Gemini-2.5-Flashを使ってコード内のポリシ全体を生成できるので、意思決定時にLCMを使用する必要がなくなるのです。その結果、コードポリシーは16のTextArena 1-playerゲームでGemini-2.5-ProやGPT-5.2-Highよりも平均的な報酬を受ける。我々の結果は、より小さなモデルを使ってカスタムコードハーネス(またはポリシー全体)を合成することで、はるかに大きなモデルより優れると同時に、コスト効率も高いことを示した。

論文の概要: AutoHarness: improving LLM agents by automatically synthesizing a code harness

関連論文リスト