Fugu-MT 論文翻訳(概要): Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

論文の概要: Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

arxiv url: http://arxiv.org/abs/2601.15479v1
Date: Wed, 21 Jan 2026 21:29:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-23 21:37:20.420469
Title: Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
Title（参考訳）: バイオメディカル・マルチドメインコンテキストにおけるPairwise Causal DiscoveryのためのLCMのベンチマーク
Authors: Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty,
Abstract要約: バイオメディシンのような高レベルの分野における大きな言語モデル(LLM)は、原因と効果を推論できる必要がある。ベンチマークでは、12の多様なデータセットを使用して、2つのコアスキルを評価します。 textbfCausal Detection (テキストに因果リンクが含まれているかどうかを識別する) および textbfCausal extract (正確な原因と効果句を抽出する)
参考スコア（独自算出の注目度）: 0.434964016971127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($κ\ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}
Abstract（参考訳）: 大規模言語モデル(LLM)をバイオメディシンのような高レベルな分野に安全に展開するには、原因と効果を推論する必要がある。テキストからのペア因果発見(PCD)という,13のオープンソース LLM を基本課題として,本能力を検証した。私たちのベンチマークでは、12の多様なデータセットを使用して、2つのコアスキルを評価しています。 1) \textbf{Causal Detection} (テキストに因果リンクが含まれているかどうかを識別) 2) \textbf{Causal extract} (正確な原因及び効果句を抽出する) 簡単な命令(ゼロショット)から、Chain-of-Thought(CoT)やFew-shot In-Context Learning(FICL)といった複雑な戦略まで、さまざまなプロンプト手法を試した。その結果,現在のモデルでは大きな欠陥がみられた。検出のための最良のモデルであるDeepSeek-R1-Distill-Llama-70Bは49.57 %(C_{detect}$)のスコアしか得られなかったが、抽出のためのQwen2.5-Coder-32B-Instructは47.12 %(C_{extract}$)に達した。モデルは単純で明示的で単文関係が最もよく機能した。しかし、暗黙の関係、複数の文にまたがるリンク、複数の因果対を含むテキストなど、より難しい(そして現実的な)ケースではパフォーマンスが急落した。我々は、高アノテータ契約(κ\ge 0.758$)で検証されたデータセット上に構築された統一された評価フレームワークを提供し、さらなる研究を促進するために、すべてのデータ、コード、プロンプトを公開しています。 https://github.com/sydneyanuyah/CausalDiscovery}{Code can available here: https://github.com/sydneyanuyah/CausalDiscovery}

論文の概要: Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

関連論文リスト