Fugu-MT 論文翻訳(概要): CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

論文の概要: CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

arxiv url: http://arxiv.org/abs/2511.01357v1
Date: Mon, 03 Nov 2025 09:05:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.188627
Title: CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
Title（参考訳）: CMI-MTL: クロスマンバインタラクションに基づくマルチタスク学習による医用視覚質問応答
Authors: Qiangguo Jin, Xianyao Zheng, Hui Cui, Changming Sun, Yuqi Fang, Cong Cong, Ran Su, Leyi Wei, Ping Xuan, Junbo Wang,
Abstract要約: 医療的視覚的質問応答(Med-VQA)は,臨床的意思決定支援と遠隔医療において重要なマルチモーダルタスクである。最近の自己注意に基づく手法は、視覚と言語間の相互意味的アライメントを扱うのに苦労している。画像とテキストからクロスモーダルな特徴表現を学習するクロスマンバインタラクションに基づくマルチタスク学習フレームワークを提案する。
参考スコア（独自算出の注目度）: 16.115735955158428
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.
Abstract（参考訳）: 医療的視覚的質問応答(Med-VQA)は,臨床的意思決定支援と遠隔医療において重要なマルチモーダルタスクである。最近の自己注意に基づく手法は、視覚と言語間の相互意味的アライメントを効果的に扱うのに苦労している。さらに、分類に基づく手法は事前に定義された解集合に依存する。このタスクを単純な分類問題として扱うことで、自由形式の回答の多様性に適応できず、自由形式の回答の詳細な意味情報を見落としてしまう可能性がある。これらの課題に対処するために、画像やテキストからクロスモーダルな特徴表現を学習するクロスマンバインタラクションベースのマルチタスク学習(CMI-MTL)フレームワークを導入する。 CMI-MTLは、細粒度視覚テキスト特徴アライメント(FVTA)、クロスモーダルインターリーブ特徴表現(CIFR)、自由形式の多タスク学習(FFAE)の3つの主要なモジュールから構成される。 FVTAは、微細な視覚的特徴アライメントにより、画像とテキストのペアで最も関連性の高い領域を抽出する。 CIFRは、クロスモーダルなインターリーブされた特徴表現を通じて、クロスモーダルなシーケンシャルな相互作用をキャプチャする。 FFAEは、自由形式の回答強化マルチタスク学習を通じて、オープンエンドの質問からの補助的知識を活用し、オープンエンドのMed-VQAに対するモデルの能力を向上させる。実験の結果,CMI-MTLは既存の3つのMed-VQAデータセット(VQA-RAD,SLAKE,OVQA)よりも優れていた。さらに,有効性を証明するために,より解釈可能性実験を行う。コードはhttps://github.com/BioMedIA-repo/CMI-MTLで公開されている。

論文の概要: CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

関連論文リスト