Fugu-MT 論文翻訳(概要): ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

論文の概要: ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

arxiv url: http://arxiv.org/abs/2605.08197v1
Date: Tue, 05 May 2026 19:53:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.453063
Title: ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions
Title（参考訳）: ReplaySCM: 介入による実行可能な因果メカニズムのベンチマーク
Authors: Serafim Batzoglou,
Abstract要約: ReplaySCMは、有限介入による因果機構誘導のための1,300項目のベンチマークである。各項目は、潜在完全に観察された非環状構造因果モデル(SCM)によって生成される二元世界を含む。 ReplaySCMは、Ordered、Block-order、Hidden-order、Hidden-roots設定を通じて、モデルに公開された構造情報を変更します。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.
Abstract（参考訳）: 言語モデルのほとんどの因果ベンチマークは、局所的な回答やグラフ構造をスコアする。本稿では,有限介入による因果メカニズム誘導のための1,300項目のベンチマークであるReplaySCMを紹介する。各項目は、潜在完全に観察された非循環的ブール構造因果モデル(SCM)によって生成される二元世界を含む。システムは、制限されたブールDSLでメカニズムマップを出力し、申請を解析し、合法性と非循環性を確認し、トレーニングと保持された介入の世界で再生する必要がある。スコリングは公式文字列ではなくリプレイ動作を使用するため、構文的に異なるメカニズムが正しく振る舞うとクレジットを受け取る。 ReplaySCMは、オーダード、ブロックオーダー、ハイデンオーダー、ハイデンルート設定を通じてモデルに開示される構造情報を変更し、有効な参照SCMを提供する代替SCMタスクを含み、トレーニングの世界に適合する意味的に異なる代替案を、分離された介入と目撃と共に要求する。フロンティアLSMは、機能的パーフェクト構造の一部を推測するが、順序やルート構造が隠された場合、ホールドアウトリプレイは急激に低下する。我々はまた、一致したサポート監査のはしごも評価した: オリジナル、エクストラワールド、カウンターエクサンドル監査(CEx)は、平均的な局所的な前処理パターンのカバレッジを0.8949から0.9815から1.0に引き上げる。オーダード/ハイデン-オーダーギャップは、この強い証拠の下で持続する。 ReplaySCMは、有限介入証拠から実行可能なリプレイ一般化を評価することで、応答レベルの因果推論とグラフ探索ベンチマークを補完する。

論文の概要: ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

関連論文リスト