Fugu-MT 論文翻訳(概要): Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

論文の概要: Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

arxiv url: http://arxiv.org/abs/2605.12163v2
Date: Wed, 13 May 2026 02:23:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.902458
Title: Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
Title（参考訳）: 自己持続潜時推論:視覚言語モデルのための長潜時推論
Authors: Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha,
Abstract要約: SCOLAR(Self-Consistent LAtent Reasoning)は、1枚のショットで補助的な視覚トークンを生成する軽量なデコンバータを導入している。 SCOLARは許容遅延CoT長を30ドル以上延長し、実世界の推論ベンチマークでオープンソースモデルの間で最先端を実現している。
参考スコア（独自算出の注目度）: 56.21523258053447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
Abstract（参考訳）: 言語推論において、長い思考の連鎖は、常により良いパフォーマンスをもたらす。しかし,既存の潜時的推論手法の性能は,潜時的シーケンスが長くなるにつれて体系的に低下する。 Information Gain Collapse -- 自動回帰生成は各ステップを事前の出力に強く依存させるため、後続のトークンは新しい情報を導入することはほとんどできません。さらに、監視対象として使用される画像埋め込みは、意味のないプレースホルダー以上の信号を提供していないことも確認しています。これらの知見に触発されたSCOLAR(Self-Consistent LAtent Reasoning)は,LLMの完全列隠蔽状態を利用した軽量なデコンバータを導入し,各トークンを独立して元の視覚空間に固定した補助的な視覚トークンを生成する。 3段階のSFTとALPO強化学習を組み合わせることで、SCOLARは30ドル以上の遅延CoT長を許容し、実世界の推論ベンチマーク(+14.12%オーバーボーン)のオープンソースモデルの間で最先端を実現し、分配外一般化を強く示している。

論文の概要: Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

関連論文リスト