Fugu-MT 論文翻訳(概要): FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

論文の概要: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

arxiv url: http://arxiv.org/abs/2506.03278v1
Date: Tue, 03 Jun 2025 18:05:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-05 21:20:13.995685
Title: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes
Title（参考訳）: FailureSensorIQ: センサの関係と障害モードを理解するためのマルチコースQAデータセット
Authors: Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam,
Abstract要約: 本稿では,MCQA(Multi-Choice Question-Answering)ベンチマークシステムであるFailureSensorIQを紹介する。従来のQAベンチマークとは異なり、本システムは障害モード、センサデータ、および各種産業資産間の関連性を通しての推論の複数の側面に焦点を当てている。
参考スコア（独自算出の注目度）: 7.788259584005182
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
Abstract（参考訳）: 我々は,大規模言語モデル(LLM)の複雑なドメイン固有のシナリオを推論し理解する能力を評価するために設計された,MCQA(Multi-Choice Question-Answering)ベンチマークシステムであるFailureSensorIQを紹介した。従来のQAベンチマークとは異なり、本システムは障害モード、センサデータ、および各種産業資産間の関連性を通しての推論の複数の側面に焦点を当てている。この作業を通じて、モデリング決定は相関分析や重要度テストといった統計ツールを使用してデータ駆動であるだけでなく、機能工学で捉えることのできる重要なコントリビュータや有用なパターンを推論できる専門的なLLMによってドメイン駆動される、というパラダイムシフトを構想する。 GPT-4, Llama, Mistral-on FailureSensorIQを含む10以上のLCMの産業的知識を, 摂動・不確実性・複雑度解析, 専門家評価, アセット・特殊知識ギャップ解析, 外部知識ベースを用いたReActエージェントを用いて評価した。強力な推論能力を持つクローズドソースモデルは専門家レベルのパフォーマンスにアプローチするが、包括的なベンチマークでは、モデル内の摂動や気晴らし、固有の知識ギャップに対して脆弱なパフォーマンスが著しく低下している。また、LLMが様々な資産に関連する3つの異なる障害予測データセットのモデリング決定をどのように進めるかを実世界のケーススタディで検証する。リリースは以下の通り。 (a)各種産業資産の専門格MCQA b)ISO文書にある非テキストデータから構築されたMCQAに基づくFailureSensorIQベンチマークとHugging Faceリーダーボード (c) LLMFeatureSelector - LLMベースの特徴選択パイプライン。このソフトウェアはhttps://github.com/IBM/FailureSensorIQで入手できる。

論文の概要: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes

関連論文リスト