Fugu-MT 論文翻訳(概要): SAS: Semantic-aware Sampling for Generative Dataset Distillation

論文の概要: SAS: Semantic-aware Sampling for Generative Dataset Distillation

arxiv url: http://arxiv.org/abs/2605.18012v1
Date: Mon, 18 May 2026 08:05:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.099996
Title: SAS: Semantic-aware Sampling for Generative Dataset Distillation
Title（参考訳）: SAS: 生成データセット蒸留のための意味認識サンプリング
Authors: Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama,
Abstract要約: 本稿では,コントラスト言語-画像事前学習(CLIP)をポストサンプリングのセマンティクスとして活用することで,データセット蒸留のセマンティック・アウェア・パースペクティブを導入する。我々のゴールは、コンパクトであるだけでなく、意味的にクラス差別的で多様である蒸留データセットを得ることです。
参考スコア（独自算出の注目度）: 55.27114962330541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.
Abstract（参考訳）: ディープニューラルネットワークは、幅広いタスクで素晴らしいパフォーマンスを実現していますが、大規模なトレーニングデータのために、計算とストレージの大幅なコストが伴います。データセットの蒸留は、下流のパフォーマンスを維持しながら効率的なモデルトレーニングを可能にするコンパクトで情報に富んだデータセットを構築することで、この課題に対処する。しかし、既存のほとんどのアプローチは、主にデータ分布や下流のトレーニング統計のマッチングに重点を置いており、蒸留データ内の高レベルなセマンティック情報を保存することに注意を払っている。本研究では,コントラスト言語-画像事前学習(CLIP)をポストサンプリングのセマンティクスとして活用することで,データセット蒸留のセマンティック・アウェア・パースペクティブを導入する。我々のゴールは、コンパクトであるだけでなく、意味的にクラス差別的で多様である蒸留データセットを得ることです。この目的のために,事前訓練された意味空間におけるクラス関連性,クラス間分離可能性,およびセット内多様性を定量化する3つのセマンティックスコアリング関数を設計する。既存の蒸留法により生成された画像プールに基づいて, 有効サンプリングのための2段階の戦略を更に展開する。第1段階のフィルタは, セマンティックに識別可能なサンプルを抽出し, 信頼性の高い候補集合を形成する。複数のデータセット、イメージプール、下流モデルにわたる大規模な実験は、一貫性のあるパフォーマンス向上を示し、セマンティック情報をデータセットの蒸留に組み込むことの有効性を強調している。

論文の概要: SAS: Semantic-aware Sampling for Generative Dataset Distillation

関連論文リスト