Fugu-MT 論文翻訳(概要): Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

論文の概要: Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

arxiv url: http://arxiv.org/abs/2605.25012v1
Date: Sun, 24 May 2026 11:32:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.654892
Title: Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
Title（参考訳）: 意味辞書からの学習:統一された視覚表現と生成のための識別的コードブックコントラスト学習
Authors: Imanol G. Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva,
Abstract要約: 識別的および生成的視覚モデルはそれぞれの領域で優れるが、意味的に不一致である。本稿では,このギャップを埋める自己教師型フレームワークであるLEASEを紹介する。 ImageNet-1Kでは、LEASEは最先端の統一的なパフォーマンスを実現し、以前のVQGANベースの手法よりも優れている。
参考スコア（独自算出の注目度）: 13.939029266977235
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.
Abstract（参考訳）: 識別的および生成的視覚モデルは、それぞれの領域で優れているが、意味的に不一致であり、統一された視覚学習への進歩を妨げる。 LEASE(LEArning from SEmantic Dictionaries)は,このギャップを補う自己教師型フレームワークである。 LEASEは1回の事前計算ステップを通じて生成された離散トークン空間で完全に動作し、データ拡張、教師モデル、オンライントークンエーザを使わずに効率的なトレーニングを可能にする。 LEASEは2つの相補的な目的を統合している。マスク付きトークン再構成損失は細かな生成の詳細をキャプチャし、コードブックのコントラスト損失はエンコーダの特徴を適応的なセントロイド重み付けによって識別的意味論と整合させる。この二重監督は、高品質な生成と強力な表現学習の両方をサポートする統一潜在空間をもたらす。 ImageNet-1Kでは、LEASEは、線形プローブ(最大+1.7%)、無条件生成(1.26 FIDと+10.19 IS w.r.t MAGE)、少数ショット学習(平均0.56%)、転送(MAGEとSorcenに対する平均改善率+0.75%)、堅牢性ベンチマーク(それぞれ5.86%と+4.25%)において、MAGEとSorcenのようなVQGANベースの手法よりも優れたパフォーマンスを実現している。また、従来のMIM法を超越しながら、ドメイン特化コントラストおよび生成モデルと競合する。教師なしLEASEモデルは、学習された表現の上に構築し、特殊なベースラインと競合することによって、条件付き生成にまで拡張することもできる。 LEASEは、視覚コンテンツを共同で理解し、生成する汎用視覚モデルに向けた、効率的かつ効果的なステップを提供する。

論文の概要: Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

関連論文リスト