Fugu-MT 論文翻訳(概要): Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

論文の概要: Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

arxiv url: http://arxiv.org/abs/2603.27987v1
Date: Mon, 30 Mar 2026 03:20:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.208398
Title: Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment
Title（参考訳）: データセット蒸留を超える:拡散支援分布アライメントによるロスレスデータセット濃度
Authors: Tongfei Liu, Yufan Liu, Bing Li, Weiming Hu,
Abstract要約: 拡散型雑音最適化法 (N) を提案し, 小さいが代表的なサンプル集合を合成し, "Opt" を用いて合成データを拡張した。 DsCoはデータアクセシビリティとデータフリーの両方のシナリオに適用可能で、低データボリュームのSOTAパフォーマンスを実現し、高いデータボリュームまで十分に拡張できます。
参考スコア（独自算出の注目度）: 43.678155518039745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The high cost and accessibility problem associated with large datasets hinder the development of large-scale visual recognition systems. Dataset Distillation addresses these problems by synthesizing compact surrogate datasets for efficient training, storage, transfer, and privacy preservation. The existing state-of-the-art diffusion-based dataset distillation methods face three issues: lack of theoretical justification, poor efficiency in scaling to high data volumes, and failure in data-free scenarios. To address these issues, we establish a theoretical framework that justifies the use of diffusion models by proving the equivalence between dataset distillation and distribution matching, and reveals an inherent efficiency limit in the dataset distillation paradigm. We then propose a Dataset Concentration (DsCo) framework that uses a diffusion-based Noise-Optimization (NOpt) method to synthesize a small yet representative set of samples, and optionally augments the synthetic data via "Doping", which mixes selected samples from the original dataset with the synthetic samples to overcome the efficiency limit of dataset distillation. DsCo is applicable in both data-accessible and data-free scenarios, achieving SOTA performances for low data volumes, and it extends well to high data volumes, where it nearly reduces the dataset size by half with no performance degradation.
Abstract（参考訳）: 大規模なデータセットに関連する高コストとアクセシビリティの問題は、大規模な視覚認識システムの開発を妨げている。 Dataset Distillationは、効率的なトレーニング、ストレージ、転送、プライバシ保護のために、コンパクトなサロゲートデータセットを合成することによって、これらの問題に対処する。既存の最先端拡散に基づくデータセット蒸留法では、理論的正当化の欠如、データボリュームへのスケーリング効率の低下、データフリーシナリオの失敗という3つの問題に直面している。これらの問題に対処するため、データセット蒸留と分布マッチングの等価性を証明し、拡散モデルの使用を正当化する理論的枠組みを確立し、データセット蒸留パラダイムに固有の効率限界を明らかにする。次に、拡散に基づくノイズ最適化(NOpt)法を用いて、小さいが代表的なサンプル集合を合成し、元のデータセットから選択されたサンプルと合成サンプルを混合して、データセット蒸留の効率限界を克服する「ドーピング」により、任意の合成データを拡張するデータセット集中(DsCo)フレームワークを提案する。 DsCoはデータアクセシビリティとデータフリーの両方のシナリオに適用可能で、低データボリュームのSOTAパフォーマンスを実現し、高いデータボリュームまで十分に拡張できます。

論文の概要: Beyond Dataset Distillation: Lossless Dataset Concentration via Diffusion-Assisted Distribution Alignment

関連論文リスト