Fugu-MT 論文翻訳(概要): Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

論文の概要: Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

arxiv url: http://arxiv.org/abs/2510.10052v1
Date: Sat, 11 Oct 2025 06:39:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.752882
Title: Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
Title（参考訳）: 医療用VLMにおける反復的視覚推論
Authors: Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng, Chunfeng Song, Xiaosong Wang, Mu Zhou, Mianxin Liu,
Abstract要約: 私たちは、人間の専門家の反復的推論プロセスをエミュレートするフレームワークViTARを紹介します。 ViTARは、医療画像をインタラクティブなオブジェクトとして扱い、モデルが多段階の視覚的推論を行えるようにする。
参考スコア（独自算出の注目度）: 21.083636394814217
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.
Abstract（参考訳）: 医用視覚言語モデル(VLM)は画像テキスト理解に優れるが、通常は局所的な視覚的手がかりを無視したシングルパス推論に依存している。しかし、臨床実践において、ヒトの専門家は最終診断に到達する前に、興味のある領域を反復的にスキャン、焦点付け、精査する。この機械と人間の知覚ギャップを狭めるために、我々は、人間の専門家の反復的推論プロセスをエミュレートする新しいVLMフレームワークであるViTARを紹介した。 ViTARは、医療画像をインタラクティブなオブジェクトとして扱い、モデルが多段階の視覚的推論を行えるようにする。このアプローチを支援するために、専門家のような診断行動をエンコードする1Kの対話型例からなる高品質な命令データセットをキュレートする。さらに、16Kの視覚的質問応答訓練データも、きめ細かい視覚的診断のためにキュレートされている。本稿では,認知的軌跡を誘導する微調整を指導し,意思決定を最適化する強化学習を行う2段階の訓練戦略を提案する。大規模な評価は、ViTARが強力な最先端モデルより優れていることを示している。視覚的注意分析は、"思考"から"再考"ラウンドに至るまで、ViTARは、臨床上重要な領域に視覚的基盤を固定し、推論中に視覚的トークンに高い注意を割り当て、その改善されたパフォーマンスに関する機械的な洞察を提供する。これらの結果は、専門家スタイルの反復的思考チェーンをVLMに組み込むことで、医療AIの性能と信頼性が向上することを示している。

論文の概要: Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

関連論文リスト