Fugu-MT 論文翻訳(概要): What is Holding Back Latent Visual Reasoning?

論文の概要: What is Holding Back Latent Visual Reasoning?

arxiv url: http://arxiv.org/abs/2605.18445v1
Date: Mon, 18 May 2026 14:14:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.708755
Title: What is Holding Back Latent Visual Reasoning?
Title（参考訳）: 遅れたビジュアル推論を控えているものは何か?
Authors: André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann,
Abstract要約: 潜在トークンが非形式的なダミーのトークンに置き換えられると、モデル精度は影響を受けないことがわかった。我々の実験は、潜伏した視覚的推論を抑える2つの重要な問題を明らかにした。
参考スコア（独自算出の注目度）: 23.63938540988447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative ``dummy'' tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.
Abstract（参考訳）: 人間は言語だけで推論するのではなく、中間的な視覚ステップを精神的にシミュレートすることで複雑な視覚問題にアプローチすることができる。このことに触発されたVision-Language Modelsに関するいくつかの研究は、最近、中間的な視覚的想像のステップとして、継続的な潜在トークンによるチェーン・オブ・シークレット推論を探求した。本研究では,近年のモデルがこのような潜在トークンをどのように活用しているかを考察する。驚くべきことに、潜在トークンが非形式的な ``dummy'' トークンに置き換えられたとき、モデル精度は影響を受けない。これは、潜在トークンがモデルの最終的な予測において最小の因果関係を担っていることを示している。この現象をよりよく理解するために、オラクル潜伏表現によって提供される訓練信号と、推論時に生成された潜伏トークンの品質を解析する。まず、既存のほとんどのデータセットにおいて、オラクル潜在トークンは、元の画像以外の限られた追加情報を提供し、タスクを実質的に単純化しないため、トレーニング中にモデルを無視し、推論時に効果的に無視する。潜在トークンが最終予測の十分なサポートを提供する診断データセットを微調整すると、モデルがそれらを因果的に依存できることが示される。第二に、推論時に生成される潜在トークンは、対応するオラクル表現から逸脱し、狭い領域に崩壊し、モデルがそれらに依存している場合でも利益を妨げます。全体として,潜時的推論の今後の進歩は,情報的中間段階を持つ高品質なデータセットと,より正確な潜時的トークン予測という,2つの重要な柱に依存することが示唆された。

論文の概要: What is Holding Back Latent Visual Reasoning?

関連論文リスト