Fugu-MT 論文翻訳(概要): Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs

論文の概要: Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs

arxiv url: http://arxiv.org/abs/2606.08483v1
Date: Sun, 07 Jun 2026 07:01:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.138797
Title: Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs
Title（参考訳）: ブラックボックスの検査 : 消費者参加型健康 LLM の独立性評価のための構造バリア
Authors: Rahul Gorijavolu, Kaushik Madapati, Pritika Vig, Rawan Abulibdeh, Nikhil Jaiswal, Mahri Kadyrova, Zeamanuel Hailu Tesfaye, Charles Senteio, Paula Maurutto, Leo Anthony Celi,
Abstract要約: 一般患者に類似した条件下で, 一般消費者向け健康 LLM の応答変動と薬効について検討した。予防接種適性検査尺度や生殖姿勢尺度などの検証された尺度を多ターンプロンプトに適応した。結果: マルチターン会話において, 症状を隠蔽する安定応答が得られた。
参考スコア（独自算出の注目度）: 2.526741713337939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background: Consumer-facing large language models are now a common source of health information, and they interpret and personalize responses rather than retrieve them. Whether their responses vary across users is a clinical, equity, and governance question, sharpened by evidence that sycophantic responses can alter judgment and increase trust. Objective: To evaluate response variation and sycophancy in consumer-facing health LLMs under conditions resembling ordinary patient use. Methods: We constructed simulated user profiles differing in geography, browsing context, expressed beliefs, and social determinants of health, drawing on literature linking social context to health attitudes. We adapted validated instruments, including the Vaccination Attitudes Examination scale and reproductive attitudes scales, into multi-turn prompts designed to elicit clinically meaningful variation across users. Results: The evaluation encountered five linked barriers. Factual prompts produced stable responses that masked sycophancy emerging over multi-turn conversation. Browser-based interfaces did not disclose which signals influence outputs and could not be reset to a clean baseline. Large-scale testing was restricted by terms of service, rate limits, and bot detection. Accuracy-based criteria could not capture tone, framing, or omission, and LLM-as-judge methods risked shared alignment bias. Models changed without traceable version identifiers, preventing reliable replication. Conclusions: No reliable independent evaluation framework yet exists for examining how consumer-facing health LLMs behave in ordinary use. Oversight requires disclosure of personalization signals, stable version identifiers, researcher safe harbor programs, and post-deployment monitoring of health-related outputs.
Abstract（参考訳）: 背景: 消費者が直面する大きな言語モデルは現在、健康情報の共通の源であり、それらを検索するのではなく、反応を解釈しパーソナライズしている。彼らの反応がユーザによって異なるかどうかは、臨床、エクイティ、ガバナンスの問題であり、サイコファンティックな反応が判断を変え、信頼を高めるという証拠によって強調される。目的: 一般患者に類似した条件下で, 一般消費者の健康 LLM の応答変動と薬効を評価すること。方法: 地域, 閲覧状況, 表現された信念, および健康の社会的決定要因で異なるユーザプロファイルを構築し, 社会的文脈と健康の態度を関連づけた文献を作成した。予防接種適性検査尺度や生殖態度尺度を多ターンプロンプトに応用した。結果: 5つの連関障壁が認められた。ファクチュアル・プロンプトは、マルチターン会話で出現する梅毒を隠蔽する安定した応答を生み出した。ブラウザベースのインタフェースは、どの信号が出力に影響を与えるかを明らかにしておらず、クリーンなベースラインにリセットできない。大規模テストはサービス、レート制限、ボット検出によって制限された。精度に基づく基準では, 音色, フレーミング, 省略が得られず, LLM-as-judge法はアライメントバイアスのリスクを負った。モデルはトレース可能なバージョン識別子なしで変更され、信頼性の高いレプリケーションが防止された。結論: 一般消費者向け健康LLMが日常的にどのように振る舞うかを調べるための信頼性の高い独立した評価フレームワークはまだ存在しない。監視には、パーソナライズ信号の開示、安定したバージョン識別子、研究者の安全なシェルプログラム、健康関連のアウトプットのデプロイ後監視が必要である。

関連論文リスト

Green Shielding: A User-Centric Approach Towards Trustworthy AI [19.485991712624095]
Green Shieldingはエビデンスベースのデプロイメントガイダンスを構築するための,ユーザ中心のアジェンダだ。 HealthCareMagic-Diagnosis による医療診断における Green Shielding のインスタンス化ここでは医療診断においてインスタンス化されているが、アジェンダは他の意思決定支援設定やエージェントAIシステムに自然に拡張されている。
論文参考訳（メタデータ） (2026-04-27T17:04:17Z)
Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness [49.2667937337333]
不完全な患者の医療反応を検出するために,この仮定をストレステストする。我々は,2つの臨床診断データセットにわたる3つの粒度(General-Likert,Analytical-Rubric,Dynamic-Checklist)と3つのバックボーンモデルを評価する。
論文参考訳（メタデータ） (2026-03-26T19:01:55Z)
Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
本稿では,現実的な医療相談におけるマルチターンインタラクションの信頼性を評価するための最初のベンチマークを提案する。本ベンチマークでは,3種類の医療データを統合し,診断を行う。本稿では,エビデンスを基盤とした言語自己評価フレームワークであるMedConfを紹介する。
論文参考訳（メタデータ） (2026-01-22T04:51:39Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
MedGaze-Benchは、臨床医の視線を認知的カーソルとして活用し、手術、緊急シミュレーション、診断解釈における意図的理解を評価する最初のベンチマークである。本ベンチマークでは,解剖学的構造の視覚的均一性,臨床における時間・因果依存性の厳格化,安全プロトコルへの暗黙の順守という3つの基本的な課題に対処する。
論文参考訳（メタデータ） (2026-01-11T02:20:40Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
大規模言語モデル(LLM)は、医療におけるAIアプリケーションで使用される。 LLMを継続的にストレステストするレッドチームフレームワークは、4つのセーフティクリティカルなドメインで重大な弱点を明らかにすることができる。敵エージェントのスイートは、自律的に変化するテストケースに適用され、安全でないトリガー戦略を特定し、評価する。私たちのフレームワークは、進化可能でスケーラブルで信頼性の高い、次世代の医療AIのセーフガードを提供します。
論文参考訳（メタデータ） (2025-07-30T08:44:22Z)
Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems [4.031787614742573]
本研究は、複数のQAベンチマークで医療用RAGパイプライン内の人口統計バイアスを系統的に評価する。我々は、思考の推論の連鎖、対実的フィルタリング、適応的即興改善、多数決の集約など、特定バイアスに対処するために、いくつかのバイアス緩和戦略を実装し、比較する。
論文参考訳（メタデータ） (2025-03-19T17:36:35Z)
Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering [51.26412822853409]
本稿では,医学的視覚的質問応答(VQA)モデルのための,パーソナライズド・フェデレーションド・ラーニング(pFL)手法を提案する。提案手法では,学習可能なプロンプトをTransformerアーキテクチャに導入し,膨大な計算コストを伴わずに,多様な医療データセット上で効率的にトレーニングする。
論文参考訳（メタデータ） (2024-10-23T00:31:17Z)
Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
誤報に対する感受性は、観測不可能な不検証の主張に対する信念の度合いを記述している。既存の感受性研究は、自己報告された信念に大きく依存している。本稿では,ユーザの潜在感受性レベルをモデル化するための計算手法を提案する。
論文参考訳（メタデータ） (2023-11-16T07:22:56Z)
Explainable Depression Symptom Detection in Social Media [2.677715367737641]
本稿では, トランスフォーマーアーキテクチャを用いて, ユーザの文章中の抑うつ症状マーカーの出現を検知し, 説明する。我々の自然言語による説明により、臨床医はバリデーションされた症状に基づいてモデルの判断を解釈できる。
論文参考訳（メタデータ） (2023-10-20T17:05:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。