Fugu-MT 論文翻訳(概要): Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

論文の概要: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

arxiv url: http://arxiv.org/abs/2505.13480v1
Date: Sun, 11 May 2025 23:55:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-21 14:49:52.260189
Title: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale
Title（参考訳）: Columbia-Suicide Severity Rating Scaleを用いた自殺スクリーニングのための推論LDMの評価
Authors: Avinash Patil, Siru Tao, Amardeep Gedhu,
Abstract要約: コロンビア・自殺評価尺度(C-SSRS)を用いた大規模言語モデルの自殺リスク自動評価能力の評価を行った。 7点重度尺度におけるClaude, GPT, Mistral, LLaMA-in分類ポストを含む6種類のモデルのゼロショット性能を評価する(レベル0-6)。その結果,Claude と GPT は人間のアノテーションと密接に一致し,Mistral は最小の順序予測誤差を達成していることがわかった。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.
Abstract（参考訳）: 自殺防止は依然として公衆衛生上の重要な課題である。 Redditのr/SuicideWatchのようなオンラインプラットフォームは、個人が自殺的な考えを表現し、コミュニティのサポートを求めるためのスペースを提供してきたが、大きな言語モデル(LLM)の出現は、個人が人間の代わりにAIシステムにアイデアを公開し始める新しいパラダイムを導入している。本研究では, C-SSRS(Columbia-Suicide Severity Rating Scale)を用いて, LLMの自殺リスク自動評価能力を評価する。本研究では,Claude,GPT,Mistral,LLaMA-inを含む6種類のモデルのゼロショット性能を7点重度尺度で評価した(レベル0-6)。その結果,Claude と GPT は人間のアノテーションと密接に一致し,Mistral は最小の順序予測誤差を達成していることがわかった。ほとんどのモデルは順序の感度を示し、典型的には隣り合う重度レベルの間で誤分類が起こる。我々はさらに、混乱パターン、誤分類源、倫理的考察を分析し、人間の監視、透明性、慎重な展開の重要性を強調している。完全なコードと補足資料はhttps://github.com/av9ash/llm_cssrs_codeで入手できる。

論文の概要: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale

関連論文リスト