Mixtral of Experts
- URL: http://arxiv.org/abs/2401.04088v1
- Date: Mon, 8 Jan 2024 18:47:34 GMT
- Title: Mixtral of Experts
- Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch,
Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,
Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume
Lample, L\'elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre
Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
Th\'eophile Gervet, Thibaut Lavril, Thomas Wang, Timoth\'ee Lacroix, William
El Sayed
- Abstract summary: Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model.
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks.
We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks.
- Score: 57.411379935325435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.
Mixtral has the same architecture as Mistral 7B, with the difference that each
layer is composed of 8 feedforward blocks (i.e. experts). For every token, at
each layer, a router network selects two experts to process the current state
and combine their outputs. Even though each token only sees two experts, the
selected experts can be different at each timestep. As a result, each token has
access to 47B parameters, but only uses 13B active parameters during inference.
Mixtral was trained with a context size of 32k tokens and it outperforms or
matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and
multilingual benchmarks. We also provide a model fine-tuned to follow
instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,
Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both
the base and instruct models are released under the Apache 2.0 license.
Related papers
- MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models.
Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z) - Yuan 2.0-M32: Mixture of Experts with Attention Router [30.8849836244273]
Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total.
Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively.
arXiv Detail & Related papers (2024-05-28T09:05:08Z) - Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains.
Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion.
BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z) - DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models [26.447210565680116]
We propose the DeepSeekMoE architecture towards ultimate expert specialization.
It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts.
We show that DeepSeekMoE achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation.
arXiv Detail & Related papers (2024-01-11T17:31:42Z) - Mistral 7B [62.17530433867458]
Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks.
arXiv Detail & Related papers (2023-10-10T17:54:58Z) - Mixture-of-Experts with Expert Choice Routing [44.777850078713634]
Prior work allocates a fixed number of experts to each token using a top-k function.
We propose a heterogeneous mixture-of-experts employing an expert choice method.
Our method improves training convergence time by more than 2x.
arXiv Detail & Related papers (2022-02-18T17:46:11Z) - Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference [17.97893143555333]
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation.
Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
arXiv Detail & Related papers (2021-09-24T20:42:16Z) - A Mixture of $h-1$ Heads is Better than $h$ Heads [63.12336930345417]
We propose the mixture of attentive experts model (MAE)
Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks.
Our analysis shows that our model learns to specialize different experts to different inputs.
arXiv Detail & Related papers (2020-05-13T19:05:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.