Fugu-MT 論文翻訳(概要): WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

論文の概要: WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

arxiv url: http://arxiv.org/abs/2605.18115v1
Date: Mon, 18 May 2026 09:24:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.225993
Title: WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
Title（参考訳）: WinTok: 視覚的理解と生成を変換可能なトークンで分解するWin-Winハイブリッドトケナイザ
Authors: Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang,
Abstract要約: WinTokは視覚的理解と生成のためのハイブリットトークンである。これは、学習可能なセマンティックトークンのセットでピクセルトークンを補完する。 WinTokは強力なベースラインであるUniTokを11.2%の精度で上回り、競争力のあるrFIDの0.41を達成している。
参考スコア（独自算出の注目度）: 27.89104188378633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.
Abstract（参考訳）: 視覚的理解と生成のギャップを埋めるためには、統一された視覚的トークン化器を構築することが不可欠である。しかし、既存のアプローチは、高いレベルのセマンティック抽象化と低レベルのピクセル再構成の両方をサポートするために、これらのタスク間の固有の衝突に苦慮している。我々は,2つの目的を明示的に切り離してウィンウィン性能を実現する,簡潔なハイブリットトークンであるWinTokを提案する。 WinTokは、学習可能なセマンティックトークンのセットでピクセルトークンを補完し、デュアルトークンの計算オーバーヘッドを発生させることなく、効果的にクロスタスク干渉を緩和する。セマンティックトークンは任意の視覚基盤モデルからの事前学習されたセマンティック埋め込みによって誘導され、柔軟性を維持しつつ強力な識別力を継承することができる。 10の挑戦的なベンチマークで、WinTokは再構築、理解、生成において一貫した改善を提供する。わずか5000万のオープンソースデータに基づいてトレーニングされたWinTokは、強力なベースラインであるUniTokを11.2%の精度で上回り、トレーニングデータはかなり少ないにもかかわらず、競争力のある復元rFIDの0.41を達成している。コードはhttps://github.com/markywg/WinTok.comで公開されている。

論文の概要: WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

関連論文リスト