Fugu-MT 論文翻訳(概要): M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

論文の概要: M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

arxiv url: http://arxiv.org/abs/2507.04289v1
Date: Sun, 06 Jul 2025 08:14:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-08 15:46:35.102746
Title: M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding
Title（参考訳）: M$3$-Med:医学教育ビデオ理解におけるマルチ言語・マルチモーダル・マルチホップ推論のためのベンチマーク
Authors: Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li, Shoujun Zhou, Hongliang Li, Fuxia Yang,
Abstract要約: M3-Medは、医療ビデオ理解におけるマルチ言語、マルチモーダル、マルチホップ推論のための最初のベンチマークである。 M3-Medの重要な革新はマルチホップ推論タスクである。これは、テキスト内の重要なエンティティを特定し、ビデオ内の対応する視覚的証拠を見つけ、最終的に両方のモダリティにまたがって情報を合成して答えを導き出すモデルを必要とする。
参考スコア（独自算出の注目度）: 13.721987547159715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning task, which requires a model to first locate a key entity in the text, then find corresponding visual evidence in the video, and finally synthesize information across both modalities to derive the answer. This design moves beyond simple text matching and poses a substantial challenge to a model's deep cross-modal understanding capabilities. We define two tasks: Temporal Answer Grounding in Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We evaluated several state-of-the-art models and Large Language Models (LLMs) on M3-Med. The results reveal a significant performance gap between all models and human experts, especially on the complex multi-hop questions where model performance drops sharply. M3-Med effectively highlights the current limitations of AI models in deep cross-modal reasoning within specialized domains and provides a new direction for future research.
Abstract（参考訳）: マルチモーダル理解における人工知能(AI)の急速な進歩に伴い、医療教育などの専門分野をサポートするためのビデオ理解技術の可能性が高まっている。しかし、既存のベンチマークには2つの主要な制限がある: (1) 言語特異性: 英語に限られており、多言語リソースの必要性を無視している; (2) ショート推論: 質問はしばしば表面レベルの情報検索のために設計されており、深いマルチモーダル統合を適切に評価できない。これらの制約に対処するため、M3-Medは、M3-Medは、医療ビデオ理解におけるマルチ言語、マルチモーダル、マルチホップ推論のための最初のベンチマークである。 M3-Medは、医療専門家のチームによって注釈付けされた、対応するビデオセグメントと組み合わせた医療質問で構成されている。 M3-Medの重要な革新は、マルチホップ推論タスクである。これは、まずテキスト内のキーエンティティを見つけ、次にビデオ内の対応する視覚的エビデンスを見つけ、最後に両方のモダリティにまたがって情報を合成して答えを導き出すモデルを必要とする。この設計は単純なテキストマッチングを超えて、モデルの深い相互理解能力に重大な課題をもたらす。本稿では,TAGSV(Temporal Answer Grounding in Single Video)とTAGVC(Temporal Answer Grounding in Video Corpus)の2つのタスクを定義した。我々はM3-Med上でいくつかの最先端モデルとLarge Language Models (LLMs)を評価した。その結果、モデルの性能が急激に低下する複雑なマルチホップ問題において、すべてのモデルと人間の専門家の間に大きなパフォーマンスギャップがあることが判明した。 M3-Medは、専門ドメイン内の深いクロスモーダル推論におけるAIモデルの現在の限界を効果的に強調し、将来の研究に新たな方向性を提供する。

関連論文リスト

MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models [48.24824129683951]
本稿では,複雑で暗黙的な医療指導に基づくセグメンテーションマスク作成を目的とした新しい課題である医用画像推論セグメンテーションを紹介する。そこで本稿では,MLLMの推論能力を利用して臨床問題を理解するエンドツーエンドフレームワークであるMedSeg-Rを提案する。 1)画像の解釈と複雑な医用命令の理解を行い,マルチモーダルな中間トークンを生成するグローバルコンテキスト理解モジュール,2)これらのトークンをデコードして正確なセグメンテーションマスクを生成するピクセルレベルのグラウンドモジュールである。
論文参考訳（メタデータ） (2025-06-12T08:13:38Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
医療知識の豊富なマルチモーダルデータセットを構築した。次に医学専門のMLLMであるLingshuを紹介します。 Lingshuは、医療専門知識の組み込みとタスク解決能力の向上のために、マルチステージトレーニングを行っている。
論文参考訳（メタデータ） (2025-06-08T08:47:30Z)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [89.39873803375498]
VideoMathQAは、ビデオ上で時間的に拡張されたクロスモーダル推論を実行できるかどうかを評価するために設計されたベンチマークである。ベンチマークは10種類の数学的領域にまたがっており、ビデオは10秒から1時間以上に及ぶ。構造化された視覚的コンテンツを解釈し、指導的物語を理解し、視覚的、音声的、テキスト的モダリティにまたがる共同概念を理解するためのモデルが必要である。
論文参考訳（メタデータ） (2025-06-05T17:59:58Z)
Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge [11.103332181075546]
M4IVQA課題は、医療指導ビデオからの情報を統合し、複数の言語を理解し、様々なモダリティの推論を必要とするマルチホップ質問に答えるモデルを評価することに焦点を当てている。 M4IVQAの参加者は、ビデオデータとテキストデータの両方を処理し、多言語クエリを理解し、マルチホップ医療質問に対する関連する回答を提供するアルゴリズムを開発することが期待されている。
論文参考訳（メタデータ） (2025-05-11T02:15:14Z)
Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine [9.881981672848598]
MedPLIBという名前のバイオメディカルドメインのための新しいエンド・ツー・エンド・マルチモーダル・大規模言語モデルを導入する。視覚的質問応答(VQA)、任意のピクセルレベルのプロンプト(ポイント、バウンディングボックス、自由形式の形状)、ピクセルレベルの接地をサポートする。その結果,MedPLIBは複数の医学的視覚言語タスクにおいて最先端の結果を得たことが示唆された。
論文参考訳（メタデータ） (2024-12-12T13:41:35Z)
Inquire, Interact, and Integrate: A Proactive Agent Collaborative Framework for Zero-Shot Multimodal Medical Reasoning [21.562034852024272]
医療における大規模言語モデル(LLM)の導入は、大きな研究関心を集めている。ほとんどの最先端のLCMは、マルチモーダル入力を直接処理できない、単調でテキストのみのモデルである。医療マルチモーダル推論問題を解決するために,マルチモーダル医療協調推論フレームワーク textbfMultiMedRes を提案する。
論文参考訳（メタデータ） (2024-05-19T18:26:11Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
本稿では,PubMedなどの公共リソースから,医用画像・テキスト・アライメントデータを自動的に収集する方法を示す。特に,まず大きな脳画像テキストデータセットを収集することにより,事前学習プロセスの合理化を図るパイプラインを提案する。また,医療領域におけるサブフィギュアをサブキャプションにマッピングするというユニークな課題についても検討した。
論文参考訳（メタデータ） (2024-04-27T05:03:42Z)
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
差別的, 生成的両マルチモーダル医療課題に対処する新しい枠組みを提案する。 Med-MoEの学習は、マルチモーダル医療アライメント、命令チューニングとルーティング、ドメイン固有のMoEチューニングの3つのステップで構成されている。我々のモデルは最先端のベースラインに匹敵する性能を達成できる。
論文参考訳（メタデータ） (2024-04-16T02:35:17Z)
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining [121.89793208683625]
医療人工知能(MAGI)は、1つの基礎モデルで異なる医療課題を解くことができる。我々は、Micical-knedge-enhanced mulTimOdal pretRaining (motoR)と呼ばれる新しいパラダイムを提案する。
論文参考訳（メタデータ） (2023-04-26T01:26:19Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
学習内容のマルチモーダル理解における機械学習モデルの能力を検証する。このデータセットには,180時間以上のビデオと9000時間以上のスライドが,各科目から10人の講師が参加している。マルチモーダル・トランスフォーマーであるPolyViLTを導入する。
論文参考訳（メタデータ） (2022-08-17T05:30:18Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。