Fugu-MT 論文翻訳(概要): Nonstandard Errors in AI Agents

論文の概要: Nonstandard Errors in AI Agents

arxiv url: http://arxiv.org/abs/2603.16744v1
Date: Tue, 17 Mar 2026 16:21:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.413319
Title: Nonstandard Errors in AI Agents
Title（参考訳）: AIエージェントにおける非標準エラー
Authors: Ruijiang Gao, Steven Chong Xiao,
Abstract要約: 我々は、現在最先端のAIコーディングエージェントが、同じデータと研究質問を与えられた場合、同じ経験的結果をもたらすかどうかを調査する。我々は,AIエージェントが,分析選択におけるエージェント対エージェントのばらつきから不確実な,大きさのテクスチノンスタンダードエラー(NSE)を示すことを発見した。これらの発見は、自動政策評価と実証研究におけるAIの利用の増加に影響を及ぼす。
参考スコア（独自算出の注目度）: 6.890249567932368
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We study whether state-of-the-art AI coding agents, given the same data and research question, produce the same empirical results. Deploying 150 autonomous Claude Code agents to independently test six hypotheses about market quality trends in NYSE TAQ data for SPY (2015--2024), we find that AI agents exhibit sizable \textit{nonstandard errors} (NSEs), that is, uncertainty from agent-to-agent variation in analytical choices, analogous to those documented among human researchers. AI agents diverge substantially on measure choice (e.g., autocorrelation vs.\ variance ratio, dollar vs.\ share volume). Different model families (Sonnet 4.6 vs.\ Opus 4.6) exhibit stable ``empirical styles,'' reflecting systematic differences in methodological preferences. In a three-stage feedback protocol, AI peer review (written critiques) has minimal effect on dispersion, whereas exposure to top-rated exemplar papers reduces the interquartile range of estimates by 80--99\% within \textit{converging} measure families. Convergence occurs both through within-family estimation tightening and through agents switching measure families entirely, but convergence reflects imitation rather than understanding. These findings have implications for the growing use of AI in automated policy evaluation and empirical research.
Abstract（参考訳）: 我々は、現在最先端のAIコーディングエージェントが、同じデータと研究質問を与えられた場合、同じ経験的結果をもたらすかどうかを調査する。 150個の自律クロードコードエージェントを配置して、NYSE TAQデータにおけるSPY(2015-2024)の市場品質トレンドに関する6つの仮説を独立に検証し、AIエージェントがサイズ可能な \textit{nonstandard error} (NSEs) を示すことを発見した。 AIエージェントは、測定選択(例えば、自己相関対)に大きく依存する。 \ variance ratio, dollar vs。シェアボリューム)。異なるモデルファミリー(Sonnet 4.6 vs.)。 \ Opus 4.6)は、系統的な方法論的嗜好の違いを反映した安定な「経験的スタイル」を示す。 3段階のフィードバックプロトコルでは、AIピアレビュー(批判書)は分散に最小限の影響しか与えないのに対し、上位級の論文への露出は、家族を測る「textit{converging}測度」において、中間的な推定範囲を80-99\%削減する。収束は、家族内推定の締め付けと、測定された家族を完全に切り替えるエージェントを通して起こるが、収束は理解するよりも模倣を反映する。これらの発見は、自動政策評価と実証研究におけるAIの利用の増加に影響を及ぼす。

論文の概要: Nonstandard Errors in AI Agents

関連論文リスト