Fugu-MT 論文翻訳(概要): A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

論文の概要: A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

arxiv url: http://arxiv.org/abs/2604.10506v1
Date: Sun, 12 Apr 2026 07:48:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.061598
Title: A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
Title（参考訳）: 身体的推論における時空間幻覚対策のための視覚言語モデルの進歩的学習戦略
Authors: Xiaoda Yang, Shuai Yang, Can Wang, Jingyang Xue, Menglan Tang, Checheng Yu, Xunzhe Zhou, Sashuai Zhou, Tao Jin, Lixin Yang, Xiangyu Yue, Zhou Zhao,
Abstract要約: 「多像幻覚推論」では、前頭と時頭クエリ間の大規模なパフォーマンス低下は、真に理解するのではなく、表面的なショートカットへの依存を示す。これを軽減するために、我々は、チェーンステップへの詳細な推論と決定的な判断に基づく、時間的連鎖構築という新しいデータセットを開発する。実験により,本手法は精度を向上するだけでなく,70%以上から6.53%まで,前向きのパフォーマンスギャップも改善することが示された。
参考スコア（独自算出の注目度）: 49.61652671596548
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70\% to only 6.53\%. This confirms the method's ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.
Abstract（参考訳）: VLM(Vision-Language Models)は静的画像理解において大きな進歩を遂げてきたが、時空間推論において重要なハードルに直面し続けている。主要なボトルネックは "multi-image reasoning hallucination"(マルチイメージ推論幻覚)である。これを緩和するために、我々はまず、複雑な推論を詳細な時空間ステップと決定的な判断に分解する新しいCoT(Chain-of-Thought)データセットを開発する。これに基づいて,我々は,CoTデータセットの教師付き事前トレーニングを開始し,論理構造を挿入し,さらにより広範な一般化のために,スケーラブルでラベルの弱いデータを微調整する,プログレッシブなトレーニングフレームワークを提案する。実験により,本手法は背骨の精度を向上するだけでなく,前向きのパフォーマンスギャップを70 %以上から6.53 %程度に削減することを示した。このことは、この手法が真の動的推論を発達させ、現在のVLMの時間バイアスを減少させる能力を確認している。

論文の概要: A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

関連論文リスト