Fugu-MT 論文翻訳(概要): Multimodal Distribution Matching for Vision-Language Dataset Distillation

論文の概要: Multimodal Distribution Matching for Vision-Language Dataset Distillation

arxiv url: http://arxiv.org/abs/2605.23482v1
Date: Fri, 22 May 2026 10:41:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.318326
Title: Multimodal Distribution Matching for Vision-Language Dataset Distillation
Title（参考訳）: ビジョンランゲージデータセット蒸留のためのマルチモーダル分布マッチング
Authors: Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon,
Abstract要約: マルチモーダル分散マッチング(Multimodal Distribution Matching)は、効率的かつ一般化可能なマルチモーダル蒸留のための幾何学的枠組みである。 MDMはデータ、モデル、損失レベルで補完的なコンポーネントを統合する。マルチモーダルなセマンティクスを保存し、蒸留コストを大幅に削減し、建築全体にわたって頑丈な、コンパクトな合成セットを生成する。
参考スコア（独自算出の注目度）: 50.411341509805936
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
Abstract（参考訳）: データセット蒸留は、下流のパフォーマンスを維持しながら、大規模なトレーニングセットをコンパクトな合成データセットに圧縮する。現代のシステムはペアの視覚言語入力でますます運用されるので、マルチモーダル蒸留は、厳密な計算とメモリ予算の下で表現品質とクロスモーダルアライメントを保たなければならない。そこで本研究では, 効率的かつ一般化可能な多モード蒸留のための幾何対応フレームワークであるMDMを提案する。具体的には、MDMはデータ、モデル、損失レベルの相補的なコンポーネントを統合する。データレベルでは、結合埋め込み空間内のクラスタからサンプリングすることで、合成画像とテキストのペアを初期化する。モデルレベルでは、事前訓練されたアンカーからの角偏差に応じて、重量空間で独立に微調整されたモデルを補間することで混合教師を形成する。損失レベルでは、対称的なコントラスト学習とともに、クロスモーダルな合意と不一致方向の関節特徴を利用する幾何認識マッチングの目的を用いて、単位超球面上の関節分布と一致させる。マルチモーダルなセマンティクスを保存し、蒸留コストを大幅に削減し、アーキテクチャ全体にわたって堅牢な構成を保ちながら、クロスアーキテクチャ評価を伴う画像テキスト検索ベンチマーク全体にわたって、MDMはコンパクトな合成セットを生成する。

論文の概要: Multimodal Distribution Matching for Vision-Language Dataset Distillation

関連論文リスト