Fugu-MT 論文翻訳(概要): MedHorizon: Towards Long-context Medical Video Understanding in the Wild

論文の概要: MedHorizon: Towards Long-context Medical Video Understanding in the Wild

arxiv url: http://arxiv.org/abs/2605.06537v1
Date: Thu, 07 May 2026 16:37:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:12.002761
Title: MedHorizon: Towards Long-context Medical Video Understanding in the Wild
Title（参考訳）: MedHorizon: 野生での長いコンテキストの医療ビデオ理解を目指して
Authors: Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu, Shuning Wang, Shuo Nie, Naiming Liu, Qifeng Chen, Yangqiu Song, Xiaomeng Li,
Abstract要約: 実際の臨床検査には、フルプロデュースなビデオ理解が必要であることが多い。既存のベンチマークでは、この証拠はすでに画像やショートクリップ、あるいは事前にセグメンテーションされたビデオを通じてローカライズされていると仮定することが多い。 MedHorizonは、長文医用ビデオ理解のためのWildベンチマークである。
参考スコア（独自算出の注目度）: 78.79695798197447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.
Abstract（参考訳）: 医療用マルチモーダル大言語モデル(MLLM)は高度な画像理解と短ビデオ解析を行うが、実際の臨床検査ではフルプロデューサの映像理解を必要とすることが多い。一般的なロングビデオとは異なり、医療処置は極めて冗長な解剖学的見解を含み、決定的な証拠は時間的に疎く、空間的に微妙で、文脈に依存している。既存のベンチマークでは、この証拠は画像、ショートクリップ、またはプレセグメンテーションされたビデオを通じて既にローカライズされており、検索前処理の問題はまだ検証されていないと仮定することが多い。 MedHorizonは、長文医用ビデオ理解のためのWildベンチマークである。メドホライゾンは759時間のフル長の臨床試験を保存し、スパースエビデンス理解とマルチホップ臨床推論を共同で評価する1,253の根拠に基づく多重選択質問を提供する。その証拠は極めてまばらで、平均で0.166%の証拠フレームしかなく、発見を解釈し集約する前に、ノイズの多い手続きストリームを探索する必要がある。代表的汎用ドメイン,医療ドメイン,長ビデオMLLMを評価した。最良のモデルは41.1%の精度にしか達せず、現在のシステムは厳密な全論理的理解からかけ離れていることを示している。さらなる分析により、パフォーマンスはより多くのフレームで確実にスケールできないこと、エビデンス検索と臨床解釈は主要なボトルネックのままであり、これらのボトルネックは手続き的推論の弱さと冗長下での注意の漂流に根ざしており、ジェネリックサンプリング手法は局所的な詳細とグローバルなカバレッジとを部分的にバランスさせるだけである。 MedHorizon は MLLM のための厳格なテストベッドを提供する。

論文の概要: MedHorizon: Towards Long-context Medical Video Understanding in the Wild

関連論文リスト