Fugu-MT 論文翻訳(概要): EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

論文の概要: EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

arxiv url: http://arxiv.org/abs/2603.12252v1
Date: Thu, 12 Mar 2026 17:58:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.29057
Title: EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Title（参考訳）: EndoCoT:拡散モデルにおける内因性連鎖のスケーリング
Authors: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang,
Abstract要約: 単一ステップのエンコーディングはChain-of-Thoughtプロセスの起動に失敗する。デコード中の不変ガイダンスにより、DiTは複雑な命令を実行可能なデノナイジングステップに段階的に分解することができない。 MLLMの推論能力を最初に活性化する新しいフレームワークである内因性Chain-of-Thought(EndoCoT)を提案する。
参考スコア（独自算出の注目度）: 40.37673945173621
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
Abstract（参考訳）: 近年,Multimodal Large Language Models (MLLM) は,空間推論などの複雑なタスクに対処するためのテキストエンコーダとして,拡散フレームワークに広く統合されている。しかし、このパラダイムには2つの限界がある。 (i)MLLMのテキストエンコーダは推論深度が不十分である。単一ステップのエンコーディングではChain-of-Thoughtプロセスの起動に失敗するが、これはMLLMが複雑なタスクの正確なガイダンスを提供するのに不可欠である。 (ii)復号処理中も指示は不変である。デコード中の不変なガイダンスにより、DiTは複雑な命令を、正しいMLLMエンコーディングであっても、段階的に実行可能なデノイングステップに分解することができない。この目的のために、我々は内因性連鎖(EndoCoT)を提案する。これはMLLMの推論能力を最初に活性化する新しいフレームワークで、反復的思考誘導モジュールを通じて潜在思考状態を反復的に精製し、これらの状態をDiTの認知過程にブリッジする。第2に、最終状態と接地的回答とを整合させることにより、テキストの監督において推論軌道が基底のままであることを保証するために、終末思考接地モジュールを適用する。これら2つのコンポーネントにより、MLLMテキストエンコーダは慎重に推論されたガイダンスを提供し、DiTはそれを段階的に実行し、最終的に複雑なタスクをステップバイステップで解決する。様々なベンチマーク(例えば、Maze、TSP、VSP、Sudoku)にわたる大規模な評価は、平均精度92.1%に達し、最強のベースラインを8.3%上回っている。

論文の概要: EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

関連論文リスト