Fugu-MT 論文翻訳(概要): Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

論文の概要: Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

arxiv url: http://arxiv.org/abs/2603.16373v1
Date: Tue, 17 Mar 2026 11:01:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.234468
Title: Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation
Title（参考訳）: 画像再構成・生成のためのセマンティック1次元トケナイザ
Authors: Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang,
Abstract要約: SemTokはセマンティックな1次元トークンライザで、2D画像を高レベルなセマンティクスで1次元の離散テキストトークンに圧縮する。 SemTokは、画像再構成の最先端を新たに設定し、非常にコンパクトなトークン表現で優れた忠実性を実現する。 SemTok上に構築したマスク付き自己回帰生成フレームワークは、下流画像生成タスクにおいて顕著な改善をもたらす。
参考スコア（独自算出の注目度）: 11.568334063059638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.
Abstract（参考訳）: 潜在空間に基づく視覚生成モデルは大きな成功を収めており、視覚的トークン化の重要性を強調している。イメージを遅延にマッピングすることで効率が向上し、下流タスクをスケールアップするためのマルチモーダルアライメントが可能になる。既存のビジュアルトークンーザは、画像を固定された2次元空間グリッドにマッピングし、ピクセルレベルの復元に焦点を当て、コンパクトなグローバルセマンティクスによる表現の捕捉を妨げる。これらの問題に対処するために,2次元画像を高レベルな意味を持つ1次元離散トークンに圧縮する意味的一次元トークン化器 \textbf{SemTok} を提案する。 SemTokは、画像再構成の最先端を新たに設定し、非常にコンパクトなトークン表現で優れた忠実性を実現する。これは、2D-to-1Dトークン化スキーム、セマンティックアライメント制約、および2段階生成トレーニング戦略という3つの重要な革新を伴うシナジスティックフレームワークによって達成される。 SemTok上に構築したマスク付き自己回帰生成フレームワークは、下流画像生成タスクにおいて顕著な改善をもたらす。セマンティック1Dトークン化の有効性を検証する実験を行った。私たちのコードはオープンソースになります。

論文の概要: Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation

関連論文リスト