ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
- URL: http://arxiv.org/abs/2412.14559v1
- Date: Thu, 19 Dec 2024 06:22:19 GMT
- Title: ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
- Authors: Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, Ruimao Zhang,
- Abstract summary: We introduce a scalable motion generation framework that includes the motion tokenizer MotionQ-VAE and a text FS-VAE transformer.
For the first time, we confirm the existence of scaling laws within the context of motion generation.
We predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$.
- Score: 27.532993606576152
- License:
- Abstract: The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$. The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.
Related papers
- Towards Precise Scaling Laws for Video Diffusion Transformers [43.6690970187664]
We analyze scaling laws for video diffusion transformers and propose a new scaling law for any model size and compute budget.
Under these optimal settings, we achieve comparable performance and reduce inference costs by 40.1% compared to conventional scaling methods.
arXiv Detail & Related papers (2024-11-25T18:59:04Z) - Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size.
We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers.
By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Scaling Laws For Diffusion Transformers [27.180452052901146]
Diffusion transformers (DiT) have achieved appealing synthesis and scaling properties in content recreation.
scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements.
Experiments across a broad range of compute budgets, from 1e17 to 6e18 FLOPs are conducted to confirm the existence of scaling laws in DiT.
arXiv Detail & Related papers (2024-10-10T17:56:03Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently.
We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Scaling Laws for Autoregressive Generative Modeling [30.051804305320424]
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$leftarrow$text models, and mathematical problem solving.
In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.
arXiv Detail & Related papers (2020-10-28T02:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.