Fugu-MT 論文翻訳(概要): ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

論文の概要: ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

arxiv url: http://arxiv.org/abs/2601.22730v1
Date: Fri, 30 Jan 2026 09:06:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.337917
Title: ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model
Title（参考訳）: ImgCoT:大規模言語モデルの効率的な推論のためのコンパクトな視覚トークンへの思考の長鎖圧縮
Authors: Xiaoshu Chen, Sihang Zhou, Ke Liang, Taichun Zhou, Xinwang Liu,
Abstract要約: 大規模言語モデル (LLM) を用いた効率的な推論には, コンパクトな潜在トークンへの思考(CoT)の長い連鎖が不可欠である。テキストCoTから画像にCoTを描画して得られる視覚CoTに置き換えるImgCoTを提案する。これにより、言語バイアスを空間的帰納バイアスに置き換え、潜在トークンがグローバルな推論構造をよりよく捉えることができる。
参考スコア（独自算出の注目度）: 34.90582960625524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.
Abstract（参考訳）: 長鎖の思考(CoT)をコンパクトな潜在トークンに圧縮することは、大きな言語モデル(LLM)を用いた効率的な推論に不可欠である。近年の研究では、テキストCoTを潜在トークンから再構成し、CoTセマンティクスを符号化することで、これを実現するためにオートエンコーダを採用している。しかし、テキストCoTを再構成対象として扱うと、潜在トークンが表層言語的特徴(例えば、単語の選択と構文)を保ち、推論構造よりも言語形式を優先し、論理的抽象化を制限する強い言語的帰納バイアスが生じる。そこで本研究では,テキストCoTから画像にCoTを描画した視覚的CoTに置き換えるImgCoTを提案する。これは、言語バイアスを空間帰納バイアス、すなわち視覚的CoTにおける推論ステップの空間的レイアウトをモデル化する傾向に代えて、潜在トークンがグローバルな推論構造をよりよく捉えることができる。さらに、視覚的潜在トークンは抽象的推論構造を符号化するが、推論の詳細を曖昧にすることができる。そこで我々は,低トークンログ類似度に基づいて選択された,いくつかの重要なテキスト推論ステップで視覚潜在トークンを増強するハイブリッド推論である,ゆるやかなImgCoTを提案する。この設計により、LLMは完全なCoTよりも少ないトークンで、グローバルな推論構造ときめ細かい推論の詳細の両方を保持することができる。複数のデータセットとLCMにわたる大規模な実験により、ImgCoTの2つのバージョンの有効性が示された。

論文の概要: ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

関連論文リスト