Fugu-MT 論文翻訳(概要): LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

論文の概要: LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

arxiv url: http://arxiv.org/abs/2603.11987v1
Date: Thu, 12 Mar 2026 14:38:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.145301
Title: LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Title（参考訳）: LABSHIELD:科学実験室における安全批判推論と計画のためのマルチモーダルベンチマーク
Authors: Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge, Jiajun Li, Sirui Han, Shanghang Zhang,
Abstract要約: MLLM(Multimodal large language model)エージェントは、ラボアシスタントから自動運転ラボオペレータへと進化する。 LABSHIELDは,危険識別と安全クリティカルな推論においてMLLMを評価するために設計された,現実的なマルチビューベンチマークである。我々は,20のプロプライエタリモデル,9つのオープンソースモデル,および3つの具体的モデルについて,デュアルトラック評価フレームワークを用いて評価する。
参考スコア（独自算出の注目度）: 41.392364324753224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.
Abstract（参考訳）: 人工知能は科学の自動化をますます加速させており、マルチモーダルな大規模言語モデル(MLLM)エージェントは、ラボアシスタントから自動運転ラボオペレータへと進化している。この移行は、脆弱なガラス製品、有害物質、高精度な実験装置が計画上の誤りや誤解釈の危険性を生じさせるような、実験室環境に厳しい安全要件を課す。しかし、このような高評価条件下でのエンボディエージェントの安全性意識と意思決定信頼性は、いまだに不十分であり、評価されている。このギャップを埋めるために,危険識別と安全クリティカルな推論においてMLLMを評価するために設計された,現実的なマルチビューベンチマークであるLABSHIELDを導入する。 US Occupational Safety and Health Administration (OSHA) 標準とGHS (Globally Harmonized System) に基づいて、LABSHIELD は164の運用タスクにまたがる厳格な安全分類を確立し、様々な操作の複雑さとリスクプロファイルを備えている。我々は,20のプロプライエタリモデル,9つのオープンソースモデル,および3つの具体的モデルについて,デュアルトラック評価フレームワークを用いて評価する。本結果から,一般領域MCQ精度と半オープンQA安全性の体系的ギャップが明らかとなり,特に危険解釈と安全対応計画において,プロの研究室シナリオにおいて平均32.0%の低下を示すモデルが得られた。これらの知見は,実験室の環境下での信頼性の高い自律的な科学的実験を確実にするための,安全中心の推論フレームワークの緊急の必要性を浮き彫りにした。完全なデータセットはまもなくリリースされる予定だ。

論文の概要: LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

関連論文リスト