Fugu-MT 論文翻訳(概要): Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

論文の概要: Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

arxiv url: http://arxiv.org/abs/2601.05529v2
Date: Thu, 15 Jan 2026 05:09:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-16 13:33:41.23454
Title: Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making
Title（参考訳）: 安全は見つからない(404年):LLMによるロボット意思決定の隠れたリスク
Authors: Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim,
Abstract要約: 安全クリティカルな環境でのAIシステムによる1つの間違いは、命がかかる可能性がある。大きな言語モデル(LLM)がロボットの意思決定に不可欠なものになると、リスクの物理的次元が大きくなる。本稿では,軽微な誤りであっても破滅的なシナリオにおいて,LCMの性能を体系的に評価する緊急的必要性について論じる。
参考スコア（独自算出の注目度）: 12.400383981686801
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
Abstract（参考訳）: 安全クリティカルな環境でのAIシステムによる1つの間違いは、命がかかる可能性がある。大きな言語モデル(LLMs)がロボットの意思決定に不可欠なものになると、リスクの物理的次元が増大する。本稿では,軽微な誤りであっても破滅的なシナリオにおいて,LCMの性能を体系的に評価する緊急的必要性について論じる。火災避難シナリオの質的評価を通じて, LLMに基づく意思決定における重大な故障事例を特定した。そこで我々は, 完全情報, 不完全情報, 安全指向空間推論 (SOSR) の7つの課題を定量的評価のために設計した。完全情報タスクは、解釈の曖昧さを最小限に抑え、視覚処理から空間的推論を分離するためにASCIIマップを利用する。不完全な情報タスクは、空間的連続性と幻覚に対するテストにおいて、欠落したコンテキストを推測するモデルを必要とする。 SOSRタスクは自然言語を用いて、生命を脅かす文脈における安全な意思決定を評価する。これらのタスク間で様々なLLMとVLM(Vision-Language Model)をベンチマークする。総合的なパフォーマンスに加えて、1%の失敗率の影響を分析し、"まれ"なエラーが破滅的な結果にどのようにエスカレートするかを強調します。いくつかのモデルがASCIIナビゲーションで0%の成功率を達成した一方で、模擬射撃訓練では、ロボットに緊急出口ではなく危険地域に向かうように指示した。現在のLLMは、安全クリティカルなシステムに直接デプロイする準備ができていません。ロボット工学では、100件の処刑のうち1件が破滅的な被害をもたらす可能性があるため、99%の精度が危険なほど誤解を招く。我々は、最先端モデルでさえ安全性を保証できず、それらへの絶対依存が許容できないリスクを生じさせることを示した。

論文の概要: Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

関連論文リスト