Related papers: Mistral 7B

Mistral 7B

URL: http://arxiv.org/abs/2310.06825v1
Date: Tue, 10 Oct 2023 17:54:58 GMT
Title: Mistral 7B
Authors: Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L\'elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth\'ee Lacroix, William El Sayed
Abstract summary: Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks.
Score: 62.17530433867458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.

Related papers

FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion [32.0871035771324]
We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-72B-Instruct. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding.
arXiv Detail & Related papers (2025-03-06T09:03:36Z)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z)
Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities. We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
LLM Pruning and Distillation in Practice: The Minitron Approach [61.56557874432008]
We present a report on compressing Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B model from Mistral NeMo 12B.
arXiv Detail & Related papers (2024-08-21T17:38:48Z)
A Teacher Is Worth A Million Instructions [4.322454918650575]
Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.
arXiv Detail & Related papers (2024-06-27T11:48:25Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
ORPO: Monolithic Preference Optimization without Reference Model [9.53888551630878]
We study the crucial role of supervised fine-tuning within the context of preference alignment. We introduce a model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.
arXiv Detail & Related papers (2024-03-12T14:34:08Z)
Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding [0.0]
We present significant advancements in the pretraining of Mistral 7B, a large-scale language model. We release models with context lengths of 4096 and 32768 tokens, and further refine performance with a specialized 16384 context length instruction-tuned model. We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set.
arXiv Detail & Related papers (2024-01-24T16:21:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.