Fugu-MT 論文翻訳(概要): CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

論文の概要: CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

arxiv url: http://arxiv.org/abs/2605.26029v1
Date: Mon, 25 May 2026 16:57:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:20.52865
Title: CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Title（参考訳）: CausaLab:AI科学者を対象としたインタラクティブな因果発見のためのスケーラブルな環境
Authors: Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng,
Abstract要約: LLMエージェントによる対話的因果発見を評価するスケーラブルな環境であるCausaLabを紹介する。各エピソードは、前回の測定記録を受け取り、マニピュレータ結晶に介入し、同じ機構で制御される保留型原子炉結晶の共鳴周波数を予測する。
参考スコア（独自算出の注目度）: 28.253879252786632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.
Abstract（参考訳）: LLMエージェントによる対話的因果発見を評価するスケーラブルな環境であるCausaLabを紹介する。以前の評価とは異なり、CausaLabは、エージェントが因果的証拠を用いて問題を解くことができるかどうかと、その答えが根底にある因果的メカニズムに関する正しい仮説によって支持されているかどうかを評価している。各エピソードは、前回の測定記録を受け取り、マニピュレータ結晶に介入し、同じ機構で制御される保留型原子炉結晶の共鳴周波数を予測する。隠れたデータ生成プロセスはランダムにサンプリングされた構造因果モデル(SCM)であるため、成功には事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を復元する必要がある。 CausaLabには、エージェントの進化するSCM仮説を記録するドメイン固有言語も含まれている。純粋に観測可能な6ノード設定では、GPT-5.2ハイは92%のタスク精度を持つが、全エッジの$F_1$はわずか0.471である。混合6ノード設定では、GPT-5.2ハイはタスク精度と全エッジの$F_1$の両方で80%を達成する。しかし、純粋な介入戦略はタスク精度と全エッジの$F_1$の両方で不十分であるため、強力なエージェントでさえ情報的介入を設計するのに苦労する。我々は、未熟な停止がエージェントの大きな弱点であると認識し、モデルに仮説と過去のデータの一貫性を検証するよう要求することは、この問題を軽減するのに役立ちます。 CausaLabは因果的理解から予測的成功を分離し、実験的な因果的推論として現在のLLMエージェントの限界を露呈する。

論文の概要: CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

関連論文リスト