Fugu-MT 論文翻訳(概要): LemonHarness Technical Report

論文の概要: LemonHarness Technical Report

arxiv url: http://arxiv.org/abs/2606.24311v1
Date: Tue, 23 Jun 2026 08:44:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.853083
Title: LemonHarness Technical Report
Title（参考訳）: レモンハーネス技術報告
Authors: Kailong Ren, Fubo Sun, Jiachen Liu, Liu Yang, Zimo Yin, Jiaying Li, Congli Yin, Ming He, Yu Huo, Jiawei Liu, Zeping Chen, Yubin Huangfu, Ronghua Li, Yixuan Wu, Xing Su, Yanzhi Xu, Likang Wu, Hongke Zhao, Lei Zhang, Xiaohui Geng, Jianping Fan,
Abstract要約: LemonHarnessはロングホライゾンエージェントのための統合実行フレームワークである。明確に定義されたワークスペース内の状態変更操作を制限します。モデル呼び出し、ツール実行、ルール知識を単一のコントロールされたバウンダリ内でもたらします。
参考スコア（独自算出の注目度）: 40.68992799867636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language model (LLM) agents are applied to longer tasks, they increasingly modify workspace state across multiple rounds of iteration. However, agents typically observe only tool outputs and log fragments, while the actual state changes occur in the file system. Without explicit workspace boundaries, state-changing operations such as file writes and temporary artifact generation may scatter changes across paths. Over time, these weakly constrained changes accumulate, making states such as modified files difficult to track. This paper presents LemonHarness, an integrated execution framework for long-horizon agents. LemonHarness establishes an explicit execution boundary by constraining state-changing operations within a clearly defined workspace and bringing model invocation, tool execution, and rule knowledge within a single controlled boundary. State-changing operations, including file writes, dependency installation, and temporary artifact creation, are executed through structured tool interfaces, with execution feedback recorded as observations available to subsequent model decisions. The system also introduces a reusable rule knowledge base, which turns recurring execution rules and acceptance criteria into runtime knowledge. LemonHarness further adds a time-aware execution mechanism that exposes elapsed and remaining budget to the model, so it can rebalance exploration, implementation, and validation effort as time pressure shifts and avoid timeouts from long waits or excessive verification. On Terminal-Bench 2.0, LemonHarness_GPT-5.3-CodeX reached 84.49% accuracy over 445 trials; pairing the same framework with the stronger GPT-5.5 backbone raised the average accuracy to 86.52% across five jobs. The results suggest that a unified runtime boundary, callable rule knowledge, and time-aware execution can improve the stability of long-horizon agent execution.
Abstract（参考訳）: 大きな言語モデル(LLM)エージェントがより長いタスクに適用されるにつれて、複数のイテレーションでワークスペースの状態が変更されるようになる。しかしながら、エージェントは通常、ツール出力とログフラグメントのみを観察し、実際の状態変化はファイルシステム内で発生します。明示的なワークスペース境界がなければ、ファイル書き込みや一時的なアーティファクト生成といった状態変化操作がパス全体に散在する可能性がある。時間が経つにつれて、これらの弱い制約のある変更が蓄積され、修正ファイルのような状態の追跡が困難になる。本稿では,ロングホライゾンエージェントのための統合実行フレームワークであるLemonHarnessについて述べる。 LemonHarnessは、明確に定義されたワークスペース内の状態変更操作を制約し、モデル呼び出し、ツール実行、ルール知識を単一のコントロールされたバウンダリ内にもたらすことで、明示的な実行境界を確立する。ファイル書き込み、依存関係のインストール、一時的なアーティファクト生成を含む状態変更操作は、構造化されたツールインターフェースを通じて実行される。また、再利用可能なルール知識ベースを導入し、繰り返し実行されるルールと受け入れ基準をランタイム知識に変換する。 LemonHarnessはさらに、経過した残予算をモデルに公開するタイムアウェアな実行メカニズムを追加して、時間のプレッシャーのシフトや、長時間の待ち時間や過剰な検証からのタイムアウトの回避として、探索、実装、バリデーションのバランスを戻すことができる。 Terminal-Bench 2.0では、LemonHarness_GPT-5.3-CodeXは445回の試験で84.49%の精度に達し、GPT-5.5のバックボーンと組み合わせることで5つのジョブの平均精度が86.52%に向上した。その結果、統一されたランタイム境界、呼び出し可能なルール知識、タイムアウェアな実行により、長い水平エージェントの実行の安定性が向上する可能性が示唆された。

論文の概要: LemonHarness Technical Report

関連論文リスト