Fugu-MT 論文翻訳(概要): CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

論文の概要: CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

arxiv url: http://arxiv.org/abs/2509.22010v1
Date: Fri, 26 Sep 2025 07:46:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.277197
Title: CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Title（参考訳）: CoFFT:ビジュアル言語モデルに対する前向き思考の連鎖
Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou,
Abstract要約: フォレスト・フォーカス思考の連鎖(英語: Chain of Foresight-Focus Thought, CoFFT)は、人間の視覚的認知をエミュレートすることによって視覚的推論を強化する訓練のないアプローチである。これらの段階は反復的に機能し、推論が視覚的焦点を導き、視覚的焦点がその後の推論を知らせる相互依存サイクルを生成する。 Qwen2.5-VL、InternVL-2.5、Llava-Nextを使った複数のベンチマークでの実証結果では、3.1-5.8%が一貫したパフォーマンス向上を示し、計算オーバーヘッドは増大した。
参考スコア（独自算出の注目度）: 61.34272727005052
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite significant advances in Vision Language Models (VLMs), they remain constrained by the complexity and redundancy of visual input. When images contain large amounts of irrelevant information, VLMs are susceptible to interference, thus generating excessive task-irrelevant reasoning processes or even hallucinations. This limitation stems from their inability to discover and process the required regions during reasoning precisely. To address this limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. Each Foresight-Focus Thought consists of three stages: (1) Diverse Sample Generation: generates diverse reasoning samples to explore potential reasoning paths, where each sample contains several reasoning steps; (2) Dual Foresight Decoding: rigorously evaluates these samples based on both visual focus and reasoning progression, adding the first step of optimal sample to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual focus toward regions most beneficial for future reasoning, before returning to stage (1) to generate subsequent reasoning samples until reaching the final answer. These stages function iteratively, creating an interdependent cycle where reasoning guides visual focus and visual focus informs subsequent reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of 3.1-5.8\% with controllable increasing computational overhead.
Abstract（参考訳）: 視覚言語モデル(VLM)の大幅な進歩にもかかわらず、それらは視覚入力の複雑さと冗長性によって制約されている。画像に大量の無関係情報が含まれている場合、VLMは干渉を受けやすいため、過剰なタスク非関連推論プロセスや幻覚を引き起こす。この制限は、推論中に必要な領域を発見し、処理できないことに起因する。この限界に対処するために、人間の視覚認知をエミュレートしてVLMの視覚的推論を強化する新しいトレーニング不要アプローチである、フォレスト・フォーカス思考の連鎖(CoFFT)を提示する。対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対数対これらの段階は反復的に機能し、推論が視覚的焦点を導き、視覚的焦点がその後の推論を知らせる相互依存サイクルを生成する。 Qwen2.5-VL、InternVL-2.5、Llava-Nextを使った複数のベンチマークでの実証的な結果から、3.1-5.8\%の性能改善と計算オーバーヘッドの制御が可能になった。

論文の概要: CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

関連論文リスト