Towards Precise Scaling Laws for Video Diffusion Transformers
- URL: http://arxiv.org/abs/2411.17470v2
- Date: Tue, 31 Dec 2024 16:25:39 GMT
- Title: Towards Precise Scaling Laws for Video Diffusion Transformers
- Authors: Yuanyang Yin, Yaqi Zhao, Mingwu Zheng, Ke Lin, Jiarong Ou, Rui Chen, Victor Shea-Jay Huang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang, Kun Gai,
- Abstract summary: We analyze scaling laws for video diffusion transformers and propose a new scaling law for any model size and compute budget.
Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods.
- Score: 43.6690970187664
- License:
- Abstract: Achieving optimal performance of video diffusion transformers within given data and compute budget is crucial due to their high training costs. This necessitates precisely determining the optimal model size and training hyperparameters before large-scale training. While scaling laws are employed in language models to predict performance, their existence and accurate derivation in visual generation models remain underexplored. In this paper, we systematically analyze scaling laws for video diffusion transformers and confirm their presence. Moreover, we discover that, unlike language models, video diffusion models are more sensitive to learning rate and batch size, two hyperparameters often not precisely modeled. To address this, we propose a new scaling law that predicts optimal hyperparameters for any model size and compute budget. Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods, within a compute budget of 1e10 TFlops. Furthermore, we establish a more generalized and precise relationship among validation loss, any model size, and compute budget. This enables performance prediction for non-optimal model sizes, which may also be appealed under practical inference cost constraints, achieving a better trade-off.
Related papers
- ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model [27.532993606576152]
We introduce a scalable motion generation framework that includes the motion tokenizer MotionQ-VAE and a text FS-VAE transformer.
For the first time, we confirm the existence of scaling laws within the context of motion generation.
We predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$.
arXiv Detail & Related papers (2024-12-19T06:22:19Z) - Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.
We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z) - Scaling Laws For Diffusion Transformers [27.180452052901146]
Diffusion transformers (DiT) have achieved appealing synthesis and scaling properties in content recreation.
scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements.
Experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT.
arXiv Detail & Related papers (2024-10-10T17:56:03Z) - Optimization Hyper-parameter Laws for Large Language Models [52.49860340549727]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.
Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.
This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z) - Scaling Laws for Fine-Grained Mixture of Experts [4.412803924115907]
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models.
In this work, we analyze their scaling properties, incorporating an expanded range of variables.
We establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.