Fugu-MT 論文翻訳(概要): Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

論文の概要: Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

arxiv url: http://arxiv.org/abs/2604.27209v2
Date: Fri, 01 May 2026 15:02:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.931123
Title: Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Title（参考訳）: 建設中の理論: 仕様が進化する研究ソフトウェアのための言語モデルのオーケストレーション
Authors: Halley Young, Nikolaj Björner,
Abstract要約: 大規模な言語モデルは、実質的なコードと研究テキストを生成することができるが、研究ソフトウェアプロジェクトは、どちらのアーティファクトも必要としない。我々は,幻覚の蓄積とデシンクロナイゼーションという,LM固有の障害モードを2つ同定する。本研究では,1つのワークスペース状態の座標として,アイデア,実装,評価,グラウンド,ペーパーライティングを編成する反復的プロンプトであるComet-Hを提案する。
参考スコア（独自算出の注目度）: 1.0312968200748116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models can now generate substantial code and draft research text, but research-software projects require more than either artifact alone. The mathematical thesis, executable system, benchmark surface, and public claims must mature together, yet often drift apart. We identify two LM-specific failure modes: hallucination accumulation, in which claims exceed what code or theory supports and unsupported assertions propagate across sessions; and desynchronization, in which code, theory, or the model's own world model fall out of alignment. We propose Comet-H, an iterative prompt automaton that orchestrates ideation, implementation, evaluation, grounding, and paper-writing as coupled coordinates of a single workspace state. At each step, a controller selects the next prompt by scoring it against what the workspace currently lacks, carries unfinished follow-up work forward with a half-life, and re-checks the paper and README against the code and benchmarks whenever documentation changes. We frame prompt selection as a small contextual bandit problem over prompt families, with prompts as arms, workspace deficits as context, and a hand-weighted linear score. This transparent scorer, paired with a fading record of unfinished work, bounds long-horizon follow-ups, requires no learned policy, and makes each prompt choice legible from the workspace. We created a portfolio of 46 research-software repositories across two dozen domains. We study A3 in depth, a Python static-analysis tool built entirely within the loop, which reaches (F1 = 0.768) on a 90-case benchmark, compared with a next-best baseline of 0.364. Across approximately 400 commits, we find that audit-and-contraction passes dominate the later phases of every successful trajectory.
Abstract（参考訳）: 大規模な言語モデルは、実質的なコードを生成することができ、研究テキストをドラフトすることができるが、リサーチ・ソフトウェアプロジェクトには、どちらのアーティファクトも必要としない。数学的理論、実行可能なシステム、ベンチマークサーフェス、パブリッククレームは共に成熟しなければならないが、しばしば崩壊する。 2つのLM固有の障害モードを識別する:幻覚の蓄積、主張がどのコードや理論がどのアサーションをサポートするかを超えること、主張がセッション間で伝播すること、デシンクロナイゼーション、コード、理論、またはモデル自身の世界モデルが整列から外れることである。本研究では,1つのワークスペース状態の協調座標として,アイデア,実装,評価,グラウンド化,ペーパーライティングを編成する反復的プロンプトオートマトンであるComet-Hを提案する。各ステップでは、コントローラが次のプロンプトを選択し、ワークスペースが現在欠落しているものに対してスコアを付け、半減期で未完成のフォローアップ作業を実行し、ドキュメントの変更時にコードとベンチマークに対して紙とREADMEを再チェックします。我々は、プロンプトファミリーに対する小さな文脈的バンディット問題として、プロンプトをアームとして、ワークスペースをコンテキストとして、手持ちの線形スコアとして、プロンプトをプロンプトする。この透明なスコアラーは、未完成の作業の記録と組み合わせて、長い水平のフォローアップを束縛し、学習されたポリシーを必要としない。 2ダースのドメインにわたる46のリサーチソフトウェアリポジトリのポートフォリオを作成しました。我々は、90ケースのベンチマークで(F1 = 0.768)到達する、完全にループ内に構築されたPythonの静的解析ツールであるA3を、次の最良ベースラインである0.364と比較して詳細に調査する。約400のコミットで、オーディション・アンド・コントラクションパスが、成功したすべての軌道の後半フェーズを支配していることがわかった。

論文の概要: Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

関連論文リスト