Fugu-MT 論文翻訳(概要): Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

論文の概要: Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

arxiv url: http://arxiv.org/abs/2602.19756v1
Date: Mon, 23 Feb 2026 12:08:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.799608
Title: Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
Title（参考訳）: プロトタイプ誘導データ合成による簡易なマルチモーダルデータセット蒸留
Authors: Junhyeok Choi, Sangwoo Mo, Minwoo Chae,
Abstract要約: 本研究では,大規模学習と最適化の必要性を解消する学習自由なデータセット蒸留フレームワークを提案する。提案手法では,CLIPを用いて画像テキスト埋め込みを抽出し,プロトタイプを取得し,UnCLIPデコーダを用いて画像合成を行う。
参考スコア（独自算出の注目度）: 8.74674837306488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.
Abstract（参考訳）: マルチモーダル学習の最近の進歩は、様々な視覚言語タスクで顕著な成功を収めている。しかし、このような進歩は大規模な画像テキストデータセットに大きく依存しており、トレーニングをコストと非効率にしている。データセットフィルタリングとプルーニングの以前の取り組みはこの問題を緩和しようとしたが、パフォーマンスを維持するためには比較的大きなサブセットが必要であり、非常に小さなサブセットで失敗する。データセットの蒸留は有望な代替手段であるが、既存のマルチモーダルデータセットの蒸留法では、画像ピクセルとテキストの特徴の完全なデータセットのトレーニングと共同最適化が必要であり、アーキテクチャに依存し、クロスアーキテクチャの一般化を制限する。そこで本研究では,大規模学習と最適化の必要性を解消し,アーキテクチャ間の一般化を向上する学習自由なデータセット蒸留フレームワークを提案する。提案手法では,CLIPを用いて画像テキスト埋め込みを抽出し,プロトタイプを取得し,UnCLIPデコーダを用いて画像合成を行い,効率よくスケーラブルなマルチモーダルデータセット蒸留を実現する。大規模な実験により,我々の手法は最適化に基づくデータセットの蒸留法やサブセットの選択法を一貫して上回り,最先端のクロスアーキテクチャの一般化を実現している。

論文の概要: Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

関連論文リスト