Fugu-MT 論文翻訳(概要): A More Word-like Image Tokenization for MLLMs

論文の概要: A More Word-like Image Tokenization for MLLMs

arxiv url: http://arxiv.org/abs/2605.17954v1
Date: Mon, 18 May 2026 07:09:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.940169
Title: A More Word-like Image Tokenization for MLLMs
Title（参考訳）: MLLMのための単語ライクな画像トークン化
Authors: Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee,
Abstract要約: 本稿では,コヒーレントなセマンティックユニットへのパッチの埋め込みをクラスタ化するDisentangled Visual Tokenization (DiVT)を提案する。多様なマルチモーダルベンチマークで、DiVTはベースラインにマッチするか、はるかに少ないビジュアルトークンで超える。
参考スコア（独自算出の注目度）: 26.120899392740203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.
Abstract（参考訳）: 現代のマルチモーダル大言語モデル(MLLM)は、通常、言語モデルを固定し、その埋め込み空間のトークン列にピクセルをマッピングする視覚プロジェクターを訓練する。しかし、言語モデルは、個々の意味的に意味のあるトークンを操作するように最適化され、一方、一般的な視覚プロジェクタは、画像を連続的で高相関の埋め込みの長いストリームに変換する。これにより、視覚トークンは、LLMが本来理解するために訓練された単語のような単位とは異なる振る舞いをする。そこで本稿では,クラスタのパッチの組込みをコヒーレントなセマンティックなユニットに組み込むディスタングル・ビジュアル・トークン化(DiVT)を提案する。 DiVTはさらに、トークン予算を画像の複雑さに適応させ、視覚エンコーダも言語モデルも変更しない明示的な精度と計算のトレードオフを提供する。多様なマルチモーダルベンチマークにおいて、DiVTはベースラインにマッチするか、はるかに少ない視覚トークンで上回り、限られたトークン予算の下で堅牢性を証明し、メモリコストとレイテンシを著しく低減し、視覚入力をLLMとより互換性のあるものにする。私たちのコードはhttps://github.com/snuviplab/DiVT.comで利用可能です。

論文の概要: A More Word-like Image Tokenization for MLLMs

関連論文リスト