Fugu-MT 論文翻訳(概要): Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

論文の概要: Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

arxiv url: http://arxiv.org/abs/2605.15734v1
Date: Fri, 15 May 2026 08:43:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.223282
Title: Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments
Title（参考訳）: 運用環境におけるLCMによるユーザ状態分類の信頼性検証のための心理学的枠組み
Authors: Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska,
Abstract要約: 本稿では,ユーザ状態評価に使用されるメトリクスが,個々のスコアレベルで安定して解釈可能であるという仮定を実証的に検証する。分析には個々のスコアの信頼性と集約された信頼性の両方が含まれており、リアルタイム適応に有用なメトリクスを識別することができる。この研究の主な貢献は、メートル法の適用性の測定可能な評価を可能にするレプリカブル評価フレームワークの提案である。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.
Abstract（参考訳）: 対話型および適応型システムにおけるユーザ状態を評価するための大規模言語モデルの使用は、個々のスコアのレベルで、そのような評価に使用されるメトリクスが安定して解釈可能であるという仮定に基づいている。本稿では,ユーザ状態の人工知能(AI)尺度の心理的信頼性に着目し,この仮定を実証的に検証する。本研究は,3種類のバイモーダル大言語モデル(GPT-4oオーディオ,Gemini 2.0 Flash,Gemini 2.5 Flash)の再現性を評価するための再現性評価手法を用いた。分析には、個々のスコアの信頼性と集約された信頼性の両方が含まれており、集約された分析にのみ価値を保持するメトリクスとリアルタイム適応に有用なメトリクスを区別することができる。その結果,計量信頼性は解釈領域の既定特性とはみなせないことがわかった。個々のスコアのレベルでの安定性の欠如は、そのようなスコアをリアルタイム適応システムにおけるユーザ状態の指標として解釈することを妨げる。同時に、この研究は、個人が不安定なメトリクスがポストホック研究における分析的有用性を保ち、相互作用を規定するルールと、満足度、信頼度、エンゲージメントといったユーザエクスペリエンスパラメータとの関係を識別することを示した。この研究の主な貢献は、問題の深刻度(基準を満たした213のメトリクスのうち31だけ)を定量化することに加えて、メートル法適用性の測定可能な評価を可能にするレプリカブル評価フレームワークの提案である。このアプローチは、結果の解釈には、信頼性の明示的な検証と、時間の経過とともに違反の監視が必要となる、適応システムのより責任のあるAI設計をサポートする。

論文の概要: Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

関連論文リスト