Scaling Laws for Generative Mixed-Modal Language Models
- URL: http://arxiv.org/abs/2301.03728v1
- Date: Tue, 10 Jan 2023 00:20:06 GMT
- Title: Scaling Laws for Generative Mixed-Modal Language Models
- Authors: Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen
Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke
Zettlemoyer
- Abstract summary: We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them.
Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws.
We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities.
- Score: 103.25737824352949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative language models define distributions over sequences of tokens that
can represent essentially any combination of data modalities (e.g., any
permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens
for language or code, and so on). To better understand the scaling properties
of such mixed-modal models, we conducted over 250 experiments using seven
different modalities and model sizes ranging from 8 million to 30 billion,
trained on 5-100 billion tokens. We report new mixed-modal scaling laws that
unify the contributions of individual modalities and the interactions between
them. Specifically, we explicitly model the optimal synergy and competition due
to data and model size as an additive term to previous uni-modal scaling laws.
We also find four empirical phenomena observed during the training, such as
emergent coordinate-ascent style training that naturally alternates between
modalities, guidelines for selecting critical hyper-parameters, and connections
between mixed-modal competition and training stability. Finally, we test our
scaling law by training a 30B speech-text model, which significantly
outperforms the corresponding unimodal models. Overall, our research provides
valuable insights into the design and training of mixed-modal generative
models, an important new class of unified models that have unique
distributional properties.
Related papers
- No Need to Talk: Asynchronous Mixture of Language Models [25.3581396758015]
SmallTalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner.
We show that SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost.
arXiv Detail & Related papers (2024-10-04T15:50:10Z) - Explore the Limits of Omni-modal Pretraining at Scale [21.82148059125346]
We propose a scalable pretraining paradigm, named Multimodal Context (MiCo)
MiCo can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process.
Our models establish 37 new records for state-of-the-art performance.
arXiv Detail & Related papers (2024-06-13T17:59:53Z) - DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling [51.055580277828]
We propose DynaMo, a suite of multi-token prediction language models that reduce net inference times.
Our models $textitdynamically$ predict multiple tokens based on their confidence in the predicted joint probability distribution.
We also propose novel ways to enhance the estimated joint probability to improve text generation quality.
arXiv Detail & Related papers (2024-05-01T22:17:57Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - OCHADAI-KYODAI at SemEval-2021 Task 1: Enhancing Model Generalization
and Robustness for Lexical Complexity Prediction [8.066349353140819]
We propose an ensemble model for predicting the lexical complexity of words and multiword expressions.
The model receives as input a sentence with a target word or MWEand outputs its complexity score.
Our model achieved competitive results and ranked among the top-10 systems in both sub-tasks.
arXiv Detail & Related papers (2021-05-12T09:27:46Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.