Fugu-MT 論文翻訳(概要): UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

論文の概要: UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

arxiv url: http://arxiv.org/abs/2603.11320v1
Date: Wed, 11 Mar 2026 21:27:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.654336
Title: UniCompress: Token Compression for Unified Vision-Language Understanding and Generation
Title（参考訳）: UniCompress:統一ビジョン言語理解と生成のためのToken Compression
Authors: Ziyao Wang, Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang, Ang Li, Lingjuan Lyu,
Abstract要約: 統一モデルは、イメージを個別のトークンにエンコードし、テキストと共にそれらを処理することによって、理解と生成の両方をサポートすることを目的としている。本稿では,画像理解と生成の両タスクのパフォーマンスを保ちながら,視覚的トークン数を大幅に削減する統一されたトークン圧縮アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 62.943173382496276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
Abstract（参考訳）: 統一モデルは、イメージを離散トークンにエンコードし、単一の自己回帰フレームワーク内でテキストと共に処理することで、理解と生成の両方をサポートすることを目的としている。この統一された設計は、アーキテクチャの単純さとクロスモーダルなシナジーを提供し、共有パラメータ化、一貫したトレーニング目標、モダリティ間のシームレスな転送を容易にする。しかし、そのようなモデルで要求される大量のビジュアルトークンは、かなりの計算とメモリオーバーヘッドをもたらし、この非効率性は、具体化されたAIシステムのようなリソース制約されたシナリオへの展開を直接妨げている。本研究では,画像理解と生成の両タスクのパフォーマンスを保ちながら,視覚的トークン数を大幅に削減する統一型トークン圧縮アルゴリズムUniCompressを提案する。本手法では,学習可能なグローバルメタトークンでガイドされるプラグイン圧縮と非圧縮機構を導入する。フレームワークは軽量でモジュール化されており、完全に再トレーニングすることなく既存のモデルへの効率的な統合を可能にする。実験の結果,提案手法は画像トークンを最大4倍削減し,推論遅延やトレーニングコストの大幅な向上を実現し,性能劣化を最小限に抑え,実世界のマルチモーダルアプリケーションにおけるトークン効率の統一モデリングの実現を実証している。

論文の概要: UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

関連論文リスト