Fugu-MT 論文翻訳(概要): A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

論文の概要: A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

arxiv url: http://arxiv.org/abs/2605.25540v1
Date: Mon, 25 May 2026 07:57:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.451182
Title: A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning
Title（参考訳）: 言語・音響表現学習による認知症検出のためのマルチモーダルフレームワーク
Authors: Loukas Ilias, Dimitris Askounis,
Abstract要約: アルツハイマー病は認知症の主要な原因であり、記憶、推論、コミュニケーション、日常生活に影響を及ぼす。近年の研究では、自発音声には認知症に関連する貴重な言語的・音響的バイオマーカーが含まれていることが示されている。本稿では,言語情報と書き起こし情報をエンドツーエンドのトレーニング可能な方法で共同で活用する,認知症自動検出のためのマルチモーダルディープラーニングフレームワークを提案する。
参考スコア（独自算出の注目度）: 10.559333552210434
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.
Abstract（参考訳）: アルツハイマー病(英語: Alzheimer's disease、AD)は、進行性神経変性疾患であり、認知症の主要な原因であり、記憶、推論、コミュニケーション、日常生活に影響を及ぼす。早期診断は、タイムリーな介入が認知の低下を遅くし、患者のケアを改善するのに役立つため、特に重要である。近年の研究では、自発音声には認知症に関連する貴重な言語的・音響的バイオマーカーが含まれていることが示されている。しかし、既存のアプローチは、独立に訓練されたモダリティ固有のモデル、特徴連結戦略、アンサンブル手法、あるいは音声と転写表現間の依存性を明示的に最大化しない注意に基づく融合機構に依存していることが多い。本研究では,言語情報と書き起こし情報をエンドツーエンドの学習可能な方法で共同で活用する,認知症自動検出のためのマルチモーダルディープラーニングフレームワークを提案する。具体的には、音声記録を10秒のセグメントに分割し、事前訓練されたHuBERTモデルを通して文脈化された音響表現を抽出する。情報的時間的音声特性をよりよく把握するために、フレームレベルの音響埋め込みを集約するために、注意統計プーリングを用いる。テキストのモダリティについては、[CLS]トークン表現を言語埋め込みとして使用する、事前訓練されたBERTモデルを用いて、転写文字を符号化する。音響とテキストの表現は、アテンションベースのAudio-Text Fusion (AT-Fusion) メカニズムで結合される。さらに、モダリティ間の相互情報を最大化し、マルチモーダル表現アライメントを改善するために、MINEの目的を導入する。融合マルチモーダル表現は最終的に認知症分類に使用される。 ADReSS ChallengeとProcessing-2データセットで実施された実験は、音声による認知症評価のための提案手法の有効性とロバスト性を示す。

関連論文リスト

Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs [85.69785384599827]
人間と物体の相互作用(Human-object Interaction、HOI)の検出は、人と物体のペアとそれらの相互作用を局在させることを目的としている。既存のメソッドはクローズドワールドの仮定の下で動作し、タスクを未定義の小さな動詞集合上の分類問題として扱う。本稿では,閉集合分類タスクから開語彙生成問題へのHOI検出を再構成する新しい生成推論・ステアブル知覚フレームワークGRASP-HOを提案する。
論文参考訳（メタデータ） (2025-12-19T14:41:50Z)
National Institute on Aging PREPARE Challenge: Early Detection of Cognitive Impairment Using Speech -- The SpeechCARE Solution [1.0486773259892048]
アルツハイマー病と関連する認知症は、60歳以上の成人の5人に1人に影響を与えるが、認知低下した人の半数以上が未診断のままである。 SpeechCAREは、認知障害に関連する微妙な音声関連手がかりをキャプチャするマルチモーダル音声処理パイプラインである。その堅牢な前処理には、自動転写、大規模言語モデル(LLM)に基づく異常検出、タスク識別が含まれる。
論文参考訳（メタデータ） (2025-11-11T11:39:20Z)
Linguistic and Audio Embedding-Based Machine Learning for Alzheimer's Dementia and Mild Cognitive Impairment Detection: Insights from the PROCESS Challenge [0.0]
音声は、音響的次元と言語的次元の両方を包含し、認知の低下に対して有望な非侵襲的バイオマーカーを提供する。本稿では,自然発声音声からの音声埋め込みと言語的特徴を両立させるプロシージャチャレンジのための機械学習フレームワークを提案する。
論文参考訳（メタデータ） (2025-10-02T06:54:55Z)
CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection [1.6418612334727776]
本稿では,アルツハイマー検出のためのマルチモーダルアーキテクチャであるCogniAlignを紹介する。音声とテキストのモダリティと、2つの非侵入的な情報ソースを統合している。 Leave-One-Subject-Outセットアップでは87.35%、Cross-Validationでは90.36%の精度を実現している。
論文参考訳（メタデータ） (2025-06-02T17:17:01Z)
Dementia Insights: A Context-Based MultiModal Approach [0.3749861135832073]
早期発見は、病気の進行を遅らせる可能性のあるタイムリーな介入に不可欠である。テキストと音声のための大規模事前学習モデル(LPM)は、認知障害の識別において有望であることを示している。本研究は,テキストデータと音声データを最高の性能のLPMを用いて統合する,コンテキストベースのマルチモーダル手法を提案する。
論文参考訳（メタデータ） (2025-03-03T06:46:26Z)
Devising a Set of Compact and Explainable Spoken Language Feature for Screening Alzheimer's Disease [52.46922921214341]
アルツハイマー病(AD)は高齢化社会において最も重要な健康問題の一つとなっている。我々は,大言語モデル(LLM)とTF-IDFモデルの視覚的機能を活用する,説明可能な効果的な機能セットを考案した。当社の新機能は、自動ADスクリーニングの解釈可能性を高めるステップバイステップで説明し、解釈することができる。
論文参考訳（メタデータ） (2024-11-28T05:23:22Z)
Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection [69.53626024091076]
アルツハイマー病(AD)は高齢者に特に顕著である。事前学習モデルの最近の進歩は、AD検出モデリングを低レベル特徴から高レベル表現にシフトさせる動機付けとなっている。本稿では,高レベルの音響・言語的特徴から,より優れたAD関連手がかりを抽出する,いくつかの効率的な手法を提案する。
論文参考訳（メタデータ） (2023-03-14T16:03:28Z)
Decoding speech perception from non-invasive brain recordings [48.46819575538446]
非侵襲的な記録から知覚音声の自己教師付き表現をデコードするために、コントラスト学習で訓練されたモデルを導入する。我々のモデルでは、3秒のMEG信号から、1,000以上の異なる可能性から最大41%の精度で対応する音声セグメントを識別できる。
論文参考訳（メタデータ） (2022-08-25T10:01:43Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
我々は、自己教師型音声表現学習の進歩に乗じて、人間の聴覚システムの最先端モデルを作成する。これらの結果から,ヒト大脳皮質における音声処理の異なる段階に関連する情報の階層構造を,自己教師型モデルで効果的に把握できることが示唆された。
論文参考訳（メタデータ） (2022-05-27T22:04:02Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERTは、協調的な音響および言語表現学習法である。我々は、事前訓練された音響モデル(wav2vec 2.0)と言語モデル(BERT)をエンドツーエンドのトレーニング可能なフレームワークに統合する。
論文参考訳（メタデータ） (2021-09-19T16:39:22Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。