Related papers: DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

URL: http://arxiv.org/abs/2506.14202v2
Date: Fri, 03 Oct 2025 08:12:25 GMT
Title: DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation
Authors: Makoto Shing, Masanori Koyama, Takuya Akiba,
Abstract summary: DiffusionBlocks is a principled framework for transforming transformer-based networks into genuinely independent trainable blocks.<n>Our experiments on a range of transformer architectures demonstrate that DiffusionBlocks training matches the performance of end-to-end training.
Score: 11.910667302899638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.

Related papers

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z)
Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model [53.77953728335891]
Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network.<n>We propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space.<n>This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion.
arXiv Detail & Related papers (2025-11-18T17:58:16Z)
Scalable Forward-Forward Algorithm [1.9580473532948401]
We propose a scalable Forward-Forward (FF) algorithm that eliminates the need for backpropagation by training each layer separately.<n>We extend FF to modern convolutional architectures, such as MobileNetV3 and ResNet18, by introducing a new way to compute losses for convolutional layers.
arXiv Detail & Related papers (2025-01-06T17:49:00Z)
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
ACDiT is a blockwise Conditional Diffusion Transformer.<n>It offers a flexible between token-wise autoregression and full-sequence diffusion.<n>We show that ACDiT performs best among all autoregressive baselines on image and video generation tasks.
arXiv Detail & Related papers (2024-12-10T18:13:20Z)
Towards Universal Dense Blocking for Entity Resolution [49.06313308481536]
We propose UniBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable corpus. By conducting domain-independent pre-training, UniBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. Our experiments show that the proposed UniBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods.
arXiv Detail & Related papers (2024-04-23T08:39:29Z)
BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion [56.9358325168226]
We propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND) Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model. Our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model.
arXiv Detail & Related papers (2024-03-23T08:40:38Z)
Unlocking Deep Learning: A BP-Free Approach for Parallel Block-Wise Training of Neural Networks [9.718519843862937]
We introduce a block-wise BP-free (BWBPF) neural network that leverages local error signals to optimize sub-neural networks separately. Our experimental results consistently show that this approach can identify transferable decoupled architectures for VGG and ResNet variations.
arXiv Detail & Related papers (2023-12-20T08:02:33Z)
An NMF-Based Building Block for Interpretable Neural Networks With Continual Learning [0.8158530638728501]
Existing learning methods often struggle to balance interpretability and predictive performance. Our approach aims to strike a better balance between these two aspects through the use of a building block based on NMF.
arXiv Detail & Related papers (2023-11-20T02:00:33Z)
Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search [55.41583104734349]
We propose to automatically remove structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (NAS) Given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which can achieve on-par or even better performance than the teacher. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss.
arXiv Detail & Related papers (2023-11-08T12:56:59Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
Learning Discrete Weights and Activations Using the Local Reparameterization Trick [21.563618480463067]
In computer vision and machine learning, a crucial challenge is to lower the computation and memory demands for neural network inference. By binarizing the network weights and activations, one can significantly reduce computational complexity. This leads to a more efficient neural network inference that can be deployed on low-resource devices.
arXiv Detail & Related papers (2023-07-04T12:27:10Z)
Block-local learning with probabilistic latent representations [2.839567756494814]
Locking and weight transport are problems because they prevent efficient parallelization and horizontal scaling of the training process. We propose a new method to address both these problems and scale up the training of large models. We present results on a variety of tasks and architectures, demonstrating state-of-the-art performance using block-local learning.
arXiv Detail & Related papers (2023-05-24T10:11:30Z)
The Cascaded Forward Algorithm for Neural Network Training [61.06444586991505]
We propose a new learning framework for neural networks, namely Cascaded Forward (CaFo) algorithm, which does not rely on BP optimization as that in FF. Unlike FF, our framework directly outputs label distributions at each cascaded block, which does not require generation of additional negative samples. In our framework each block can be trained independently, so it can be easily deployed into parallel acceleration systems.
arXiv Detail & Related papers (2023-03-17T02:01:11Z)
Latent Iterative Refinement for Modular Source Separation [44.78689915209527]
Traditional source separation approaches train deep neural network models end-to-end with all the data available at once. We argue that we can significantly increase resource efficiency during both training and inference stages.
arXiv Detail & Related papers (2022-11-22T00:02:57Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Block-wise Training of Residual Networks via the Minimizing Movement Scheme [10.342408668490975]
We develop a layer-wise training method, particularly well to ResNets, inspired by the minimizing movement scheme for gradient flows in distribution space. The method amounts to a kinetic energy regularization of each block that makes the blocks optimal transport maps and endows them with regularity. It works by alleviating the stagnation problem observed in layer-wise training, whereby greedily-trained early layers overfit and deeper layers stop increasing test accuracy after a certain depth.
arXiv Detail & Related papers (2022-10-03T14:03:56Z)
FFNB: Forgetting-Free Neural Blocks for Deep Continual Visual Learning [14.924672048447338]
We devise a dynamic network architecture for continual learning based on a novel forgetting-free neural block (FFNB) Training FFNB features on new tasks is achieved using a novel procedure that constrains the underlying parameters in the null-space of the previous tasks.
arXiv Detail & Related papers (2021-11-22T17:23:34Z)
BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement [26.39206098000297]
We present a blockwise optimization method for masking-based networks (BLOOM-Net) for training scalable speech enhancement networks. Our experiments on speech enhancement demonstrate that the proposed blockwise optimization method achieves the desired scalability with only a slight performance degradation compared to corresponding models trained end-to-end.
arXiv Detail & Related papers (2021-11-17T20:11:07Z)
All at Once Network Quantization via Collaborative Knowledge Transfer [56.95849086170461]
We develop a novel collaborative knowledge transfer approach for efficiently training the all-at-once quantization network. Specifically, we propose an adaptive selection strategy to choose a high-precision enquoteteacher for transferring knowledge to the low-precision student. To effectively transfer knowledge, we develop a dynamic block swapping method by randomly replacing the blocks in the lower-precision student network with the corresponding blocks in the higher-precision teacher network.
arXiv Detail & Related papers (2021-03-02T03:09:03Z)
Attentive Gaussian processes for probabilistic time-series generation [4.94950858749529]
We propose a computationally efficient attention-based network combined with the Gaussian process regression to generate real-valued sequence. We develop a block-wise training algorithm to allow mini-batch training of the network while the GP is trained using full-batch. The algorithm has been proved to converge and shows comparable, if not better, quality of the found solution.
arXiv Detail & Related papers (2021-02-10T01:19:15Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.