Fugu-MT 論文翻訳(概要): HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

論文の概要: HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

arxiv url: http://arxiv.org/abs/2606.13289v1
Date: Thu, 11 Jun 2026 12:46:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.79513
Title: HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
Title（参考訳）: HYDRA-X: ホロスティックな視覚トケナイザを持つネイティブ統一マルチモーダルモデル
Authors: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang,
Abstract要約: 単一の視覚変換器(ViT)内で画像とビデオのトークン化を統合する最初のUMMであるHYDRA-Xを提案する。私たちのデザインは、画像とビデオのセマンティックな認識を潜伏した空間に注入し、それを圧縮する、という2つの課題によって推進されています。 7Bモデルで実証されたHYDRA-Xは、画像とビデオの理解と生成タスク間で強いパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 48.01715215603613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.
Abstract（参考訳）: ホロスティックな視覚トークン化器は、様々な視覚入力を統一表現空間にマッピングするため、統一マルチモーダルモデル(UMM)の基本である。本稿では、単一の視覚変換器(ViT)内で画像とビデオのトークン化を統一する最初のUMMであるHYDRA-Xを提案する。我々のデザインは、2つの中核的な課題によって駆動される: 時空間再構成機能をネイティブなViTに効率よく注入し、画像レベルの意味認識とビデオレベルの意味認識を潜伏空間に埋め込む。 1) フレームレベルの因果的注意が視覚再建に十分であるのに対し, 全時空間的注意は低下し, 2) 階層的時間的圧迫は1段階の代替よりも大幅に優れていた。第2の課題に対処するため,映像教師と教師が共同で時間的に圧縮した特徴をアップサンプし,コンパクトな潜伏空間内に補完的な意味構造を付加する軽量な圧縮機を提案する。ソース-ターゲット相互作用は, LLM内の意味レベルよりも, トークンの潜在レベルにおいて発生すべきであり, 編集一貫性を著しく向上し, 収束を加速する。 7B高密度モデルで実証されたHYDRA-Xは、画像とビデオの理解と生成タスク間で強い性能を達成し、将来の統合トケナイザ UMM への道を開く。

論文の概要: HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

関連論文リスト