Related papers: FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models

URL: http://arxiv.org/abs/2409.19289v1
Date: Sat, 28 Sep 2024 08:57:17 GMT
Title: FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Xin Geng,
Abstract summary: FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models. It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes'' It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
Score: 35.40065954148091
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models often face slow convergence, and existing efficient training techniques, such as Parameter-Efficient Fine-Tuning (PEFT), are primarily designed for fine-tuning pre-trained models. However, these methods are limited in adapting models to variable sizes for real-world deployment, where no corresponding pre-trained models exist. To address this, we introduce FINE, a method based on the Learngene framework, to initializing downstream networks leveraging pre-trained models, while considering both model sizes and task-specific requirements. FINE decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $\Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as ``learngenes'', and $\Sigma$ remains layer-specific. During initialization, FINE trains only $\Sigma$ using a small subset of data, while keeping the learngene parameters fixed, marking it the first approach to integrate both size and task considerations in initialization. We provide a comprehensive benchmark for learngene-based methods in image generation tasks, and extensive experiments demonstrate that FINE consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes. FINE also offers significant computational and storage savings, reducing training steps by approximately $3N\times$ and storage by $5\times$, where $N$ is the number of models. Additionally, FINE's adaptability to tasks yields an average performance improvement of 4.29 and 3.30 in FID and sFID across multiple downstream datasets, highlighting its versatility and efficiency.

Related papers

KIND: Knowledge Integration and Diversion in Diffusion Models [40.442303050947395]
We introduce textbfKIND, which performs textbfKnowledge textbfINtegration and textbfDiversion in diffusion models. KIND redefines traditional pre-training methods by adjusting training objectives from maximizing model performance on current tasks to condensing transferable common knowledge. Results indicate that KIND achieves state-of-the-art performance compared to other PEFT and learngene methods.
arXiv Detail & Related papers (2024-08-14T07:22:28Z)
WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models [37.97945436202779]
WAVE is a novel approach to initializing variable-sized models. WAVE employs shared, size-agnostic weight templates alongside size-specific weight scalers. WAVE achieves state-of-the-art performance in initializing models of various depth and width.
arXiv Detail & Related papers (2024-06-25T12:43:33Z)
Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models [40.21274215353816]
We introduce the Learngene framework, which learns one compact part termed as learngene from a large well-trained model. We then expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch.
arXiv Detail & Related papers (2024-04-25T06:04:34Z)
Fisher Mask Nodes for Language Model Merging [0.0]
We introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning. Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost.
arXiv Detail & Related papers (2024-03-14T21:52:26Z)
Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning. Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation. Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z)
Building Variable-sized Models via Learngene Pool [39.99697115082106]
Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for building numerous networks with different complexity and performance trade-offs. SN-Net faces challenges to build smaller models for low resource constraints. We propose a novel method called Learngene Pool to overcome these challenges.
arXiv Detail & Related papers (2023-12-10T03:46:01Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training. We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z)
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network. PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.