Mistral 7B
- URL: http://arxiv.org/abs/2310.06825v1
- Date: Tue, 10 Oct 2023 17:54:58 GMT
- Title: Mistral 7B
- Authors: Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,
Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, L\'elio Renard Lavaud, Marie-Anne Lachaux,
Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\'ee Lacroix,
William El Sayed
- Abstract summary: Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation.
We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks.
- Score: 62.17530433867458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered
for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B
across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and
code generation. Our model leverages grouped-query attention (GQA) for faster
inference, coupled with sliding window attention (SWA) to effectively handle
sequences of arbitrary length with a reduced inference cost. We also provide a
model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses
the Llama 2 13B -- Chat model both on human and automated benchmarks. Our
models are released under the Apache 2.0 license.
Related papers
- MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts.
We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights.
Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - LLM Pruning and Distillation in Practice: The Minitron Approach [57.57486238643575]
We present a report on compressing Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters.
We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning.
This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B model from Mistral NeMo 12B.
arXiv Detail & Related papers (2024-08-21T17:38:48Z) - A Teacher Is Worth A Million Instructions [4.322454918650575]
Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.
arXiv Detail & Related papers (2024-06-27T11:48:25Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - ORPO: Monolithic Preference Optimization without Reference Model [9.53888551630878]
We study the crucial role of supervised fine-tuning within the context of preference alignment.
We introduce a model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.
Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.
arXiv Detail & Related papers (2024-03-12T14:34:08Z) - Large Malaysian Language Model Based on Mistral for Enhanced Local
Language Understanding [0.0]
We present significant advancements in the pretraining of Mistral 7B, a large-scale language model.
We release models with context lengths of 4096 and 32768 tokens, and further refine performance with a specialized 16384 context length instruction-tuned model.
We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set.
arXiv Detail & Related papers (2024-01-24T16:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.