Related papers: Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

URL: http://arxiv.org/abs/2510.04838v1
Date: Mon, 06 Oct 2025 14:22:28 GMT
Title: Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation
Authors: Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin,
Abstract summary: We propose Automatic Truncated Backpropagation Through Time (AT-BPTT) for dataset distillation.<n>AT-BPTT adapts both truncation positions and window sizes according to intrinsic gradient behavior.<n>Experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance.
Score: 11.37339433547758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

Related papers

Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation [109.13471554184554]
We reformulate dataset distillation as an Optimal Transport (OT) distance minimization problem.<n>OT offers a geometrically faithful framework for distribution matching.<n>Our method consistently outperforms state-of-the-art approaches in an efficient manner.
arXiv Detail & Related papers (2025-11-29T04:04:05Z)
Tri-Accel: Curvature-Aware Precision-Adaptive and Memory-Elastic Optimization for Efficient GPU Usage [0.6511750267058007]
Tri-Accel is a unified optimization framework that co-adapts three acceleration strategies along with adaptive parameters during training.<n>On CIFAR-10 with ResNet-18 and EfficientNet-B0, Tri-Accel achieves up to 9.9% reduction in training time and 13.3% lower memory usage.<n>Compared to static mixed-precision training, Tri-Accel maintains 78.1% accuracy while reducing memory footprint from 0.35GB to 0.31GB on standard hardware.
arXiv Detail & Related papers (2025-08-23T05:38:42Z)
Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields [6.66530903309279]
We propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA)<n>We show that PATA can preserve rendering fidelity while reducing inference time steps by 64% and running power by 61.55%.
arXiv Detail & Related papers (2025-07-30T18:56:24Z)
Leveraging Stochastic Depth Training for Adaptive Inference [1.996143466020199]
We propose a simpler yet effective alternative for adaptive inference with a zero-overhead, single-model, and time-predictable inference.<n>Compared to original ResNets, our method shows improvements of up to 2X in power efficiency at accuracy drops as low as 0.71%.
arXiv Detail & Related papers (2025-05-23T08:36:56Z)
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer [49.1761733723771]
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation.<n>We introduce three key innovations: Efficient Training Scaling, Model Depth Pruning, and Inference-time Scaling.<n>Through these strategies, SANA-1.5 achieves a text computation-image alignment score of 0.81 on GenEval, which can be further improved to 0.96 through inference scaling with VILA-Judge.
arXiv Detail & Related papers (2025-01-30T15:31:48Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models [33.911521719528686]
Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach is using Zeroth-Order (ZO) gradients, which estimates to replace First-Order (FO) gradients. We introduce a novel layer-wise sparse computation and memory efficient ZO, named LeZO.
arXiv Detail & Related papers (2024-10-13T12:47:37Z)
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency. Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z)
Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z)
Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics. We prove that our model-based procedure converges in noisy gradient setting. This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
SASL: Saliency-Adaptive Sparsity Learning for Neural Network Acceleration [20.92912642901645]
We propose a Saliency-Adaptive Sparsity Learning (SASL) approach for further optimization. Our method can reduce 49.7% FLOPs of ResNet-50 with very negligible 0.39% top-1 and 0.05% top-5 accuracy degradation.
arXiv Detail & Related papers (2020-03-12T16:49:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.