Fugu-MT 論文翻訳(概要): Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

論文の概要: Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

arxiv url: http://arxiv.org/abs/2604.18245v1
Date: Mon, 20 Apr 2026 13:25:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.898497
Title: Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols
Title（参考訳）: 訂正と破損: LLMプロトコルにおけるエラーフローの2段階的考察
Authors: Fernando Reitich,
Abstract要約: そこで本研究では,単一プロトコルステップを正確なマッチングタスクで監査するためのペアアウトカム計測インタフェースを提案する。各インスタンスについて、インターフェースはベースラインの正当性ビットと後ステップの正当性ビットを記録する。これらのレートは精度の変化を予測し、種、混合物、パイプライン間でテスト可能な再利用可能な経験的インターフェースを定義する。
参考スコア（独自算出の注目度）: 51.56484100374058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly deployed as protocols: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are evaluated only by end-to-end accuracy, giving limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit $E_1\in\{0,1\}$, separating correction ($E_0=0\to E_1=1$) from corruption ($E_0=1\to E_1=0$) through two rates: $c=\Pr(E_1=1\mid E_0=0)$ and $γ=\Pr(E_1=0\mid E_0=1)$. These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. We identify three failure mechanisms. Under mixture shift, pooled estimates of $(c,γ)$ become biased when calibration and deployment mixtures differ; conditioning on a difficulty proxy restores stability without additional model calls. Under presentation contamination, selection protocols alter the interface through stable presentation artifacts when candidate content is fixed. Under state insufficiency, the correctness bit may not carry enough history for multi-step pipelines to compose predictably; a Markov factorization test identifies when composition is valid and where additional state is needed. When a protocol step passes these diagnostics, it becomes an auditable module: gated by estimated gain, conditioned on a difficulty proxy to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks and on GSM8K, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.
Abstract（参考訳）: 大規模言語モデルはプロトコルとしてますます多くデプロイされている: ベースラインの回答を最終的に変換するために追加の計算に費やす構造化されたマルチコールプロシージャ。これらのプロトコルはエンド・ツー・エンドの精度でのみ評価され、いつ助けられるか、いつ傷つくか、分布シフトや構成の下での行動伝達の有無について限られた洞察を与える。そこで本研究では,単一プロトコルステップを正確なマッチングタスクで監査するためのペアアウトカム計測インタフェースを提案する。それぞれのインスタンスに対して、インターフェースはベースライン正しさビット$E_0\in\{0,1\}$とポストステップ正しさビット$E_1\in\{0,1\}$とを2つのレートで記録する: $c=\Pr(E_1=1\mid E_0=0)$と$γ=\Pr(E_1=0\mid E_0=1)$。これらのレートは精度の変化を予測し、種、混合物、パイプライン間でテスト可能な再利用可能な経験的インターフェースを定義する。 3つの障害機構を同定する。混合シフトでは、キャリブレーションとデプロイメントの混合が異なる場合、$(c,γ)$のプール推定値がバイアスとなる。表示汚染下では、選択プロトコルは、候補コンテンツが固定されたときに、安定した表示アーティファクトを通してインターフェースを変更する。状態不足下では、正当性ビットは多段階パイプラインを構成するのに十分な履歴を持たず、マルコフ分解テストは、構成が有効で、追加の状態が必要かを識別する。プロトコルステップがこれらの診断をパスすると、推定ゲインによってゲートされ、難易度プロキシに条件付きで混合バイアスを補正し、予測可能な精度でマルチステップパイプラインを構成する、監査可能なモジュールになる。我々はこれらのアイデアを、合成数学のタスクやGSM8K上で実証し、キャリブレーションされたインタフェースは、いつプロトコルステップをアクティベートするか、あるいは抑制すべきかを正確に予測する。

論文の概要: Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

関連論文リスト