Fugu-MT 論文翻訳(概要): Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

論文の概要: Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2508.17675v3
Date: Sat, 06 Sep 2025 18:23:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 12:02:46.89028
Title: Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models
Title（参考訳）: 生成型多モーダル大言語モデルを用いた認知評価のための規範データの合成に向けて
Authors: Victoria Yan, Honor Chotkowski, Fengran Wang, Xinhui Li, Carl Yang, Jiaying Lu, Runze Yan, Xiao Hu, Alex Fedorov,
Abstract要約: 新しいイメージ刺激に基づく新しい認知テストの開発は、手軽に利用できる規範データがないために困難である。近年のMLLM(Generative Multimodal Large Language Model)の進歩は、既存の認知テスト画像から合成規範データを生成する新しいアプローチを提供する。
参考スコア（独自算出の注目度）: 15.287990843387382
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.
Abstract（参考訳）: 認知的評価は、個々のパフォーマンスを評価するのに必須のベンチマークとして規範的データを必要とする。したがって、新しいイメージ刺激に基づく新しい認知テストの開発は、手軽に利用できる規範データがないために困難である。従来のデータ収集手法は費用がかかり、時間がかかり、頻繁に更新されるため、実用性は制限される。近年のMLLM(Generative Multimodal Large Language Model)の進歩は、既存の認知テスト画像から合成規範データを生成する新しいアプローチを提供する。本研究では,MLLM,特にGPT-4oとGPT-4o-miniを用いて,Cookie Theft 画像記述タスクなどの画像ベース認知評価のための規範的テキスト応答を合成する可能性を検討した。 2つの異なるプロンプト的プロンプト-基本的な指示と高度なプロンプト-文脈的ガイダンスによって強化される。診断群と人口動態の差異を識別するために, 埋め込みを用いて反応を解析した。パフォーマンス指標としては、BLEU、ROUGE、BERTScore、LCM-as-a-judgeの評価がある。先進的なプロンプト戦略は、より効果的に診断グループと捕獲された人口の多様性を区別する合成反応を生み出した。上位モデルでは、より高いリアリズムと多様性を示す応答が生成される。 BERTScoreは文脈的類似性評価の最も信頼性の高い指標として登場したが、BLEUは創造的なアウトプットの評価にはあまり効果的ではなかった。 LLM-as-a-judgeアプローチは有望な事前検証結果を提供する。本研究は, 改良プロンプト法により誘導された多モードLCMが, 既存の認知検査のための堅牢な合成規範データを生成可能であることを実証し, 従来の制約を伴わずに, 新たな画像ベース認知アセスメントを開発するための基盤となることを実証した。

論文の概要: Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

関連論文リスト