Fugu-MT 論文翻訳(概要): Leveraging Latent Visual Reasoning in Silence

論文の概要: Leveraging Latent Visual Reasoning in Silence

arxiv url: http://arxiv.org/abs/2605.18641v1
Date: Mon, 18 May 2026 16:46:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.106006
Title: Leveraging Latent Visual Reasoning in Silence
Title（参考訳）: サイレンスにおける潜在視覚推論の活用
Authors: Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu,
Abstract要約: 遅延トークンをランダムノイズに置き換えたり、取り除いたりすると、空間推論ベンチマーク間で性能劣化がほとんどないことが示される。本稿では、RL中に生成された潜在トークンが後続のテキストトークンと対話することを奨励するアテンションベースの報酬を提案する。
参考スコア（独自算出の注目度）: 46.71750408786006
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.
Abstract（参考訳）: 潜在的視覚推論は、テキスト生成の前に連続的な潜伏トークンを挿入することで、より直接的に多モーダル推論において視覚的エビデンスを含む。しかし、これらの潜在トークンの推論における必要性はあいまいである。遅延トークンをランダムノイズに置き換えたり、取り除いたりすると、空間推論ベンチマーク間で性能劣化がほとんどないことが示される。強化学習は、ポストトレーニング後の潜在世代行動をさらに減少させる。これらの観察は中心的な疑問を提起する:潜伏した視覚的推論はまだ意味があるのか? その価値は、推論時フォーマットとして持続するかどうかよりも、潜在トークンが学習をいかに効果的に導くかによって測定されるべきである、と我々は主張する。分析の結果,潜時推論は質問の種類によって不均一に有利であるが,潜時生成に適用するためのタスクレベルのルーティングは不安定であることがわかった。これらの知見に触発されて、RL中に生成された潜在トークンが後続のテキストトークンと対話することを奨励するアテンションベースの報酬を提案する。この報酬は、純粋テキスト推論を使用する柔軟性を維持しながら、潜時モードがアクティブになったときに潜時利用を促進する。実験の結果,遅延トークンがポストトレーニング後に生成されることがほとんどない場合でも,認識と視覚的推論のベンチマークによる性能向上が確認できた。以上の結果から,暗黙下での視覚的根拠形成やテキスト的推論の精度向上に寄与することが示唆された。私たちのコードとトレーニングされたモデルは、 \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} と \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face} で公開されています。

論文の概要: Leveraging Latent Visual Reasoning in Silence

関連論文リスト