Fugu-MT 論文翻訳(概要): Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

論文の概要: Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

arxiv url: http://arxiv.org/abs/2603.19232v1
Date: Thu, 19 Mar 2026 17:59:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.338065
Title: Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Title（参考訳）: キュービック離散拡散:高次元表現トークンにおける離散視覚生成
Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu,
Abstract要約: 高次元表現のための最初の離散生成モデルであるCub Discrete Diffusion (CubiD)を述べる。立方体は高次元離散表現を通してきめ細かいマスキングを行う。 ImageNet-256では、900Mから3.7Bパラメータの強いスケーリング動作を持つ最先端の離散生成を実現している。
参考スコア（独自算出の注目度）: 88.42820935044021
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
Abstract（参考訳）: 離散トークンを用いた視覚生成は、言語モデルと共有される統一トークン予測パラダイムを可能にし、シームレスなマルチモーダルアーキテクチャを約束することで、大きな注目を集めている。しかし、現在の離散生成法は低次元の潜在トークン(典型的には8-32ディム)に限られており、理解に不可欠な意味豊かさを犠牲にしている。高次元事前訓練された表現 (768-1024 dims) はこのギャップを埋める可能性があるが、それらの離散生成は根本的な問題を引き起こす。本稿では,高次元表現のための最初の離散生成モデルCubiDを提案する。立方体は高次元の離散表現を通してきめ細かいマスキングを行い、任意の位置にある任意の次元は部分的な観測から隠蔽し予測することができる。これにより、モデルが空間的位置内および空間的位置間のリッチな相関を学習し、特徴次元に関係なく、$T$で固定された生成ステップの数を$T \ll hwd$とする。 ImageNet-256では、900Mから3.7Bパラメータの強いスケーリング動作を持つ最先端の離散生成を実現している。重要なことは、これらの離散化トークンが元の表現能力を保っていることを検証し、同じ離散化トークンが理解タスクと生成タスクの両方に効果的に役立つことを示す。この研究が、統合マルチモーダルアーキテクチャに向けた将来の研究を刺激することを期待しています。コードは、https://github.com/YuqingWang1029/CubiD.comで入手できる。

論文の概要: Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

関連論文リスト