Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
- URL: http://arxiv.org/abs/2507.22448v1
- Date: Wed, 30 Jul 2025 07:55:33 GMT
- Title: Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
- Authors: Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, Mugariya Farooq, Giulia Campesan, Ruxandra Cojocaru, Yasser Djilali, Shi Hu, Iheb Chaabane, Puneesh Khanna, Mohamed El Amine Seddik, Ngoc Dung Huynh, Phuc Le Khac, Leen AlQadi, Billel Mokeddem, Mohamed Chami, Abdalgader Abubaker, Mikhail Lubinets, Kacper Piskorski, Slim Frikha,
- Abstract summary: Falcon-H1 is a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency.<n>Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters.<n>With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications.
- Score: 7.261605702995345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.
Related papers
- Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models [164.47008715747822]
Nemotron-H is a family of 8B and 56B/47B hybrid Mamba-Transformer models.<n>We replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers.<n>Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized open-sourced Transformer models.
arXiv Detail & Related papers (2025-04-04T17:41:58Z) - DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans.<n>modelavoids the time latency associated with token-level generation.<n>Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z) - Hymba: A Hybrid-head Architecture for Small Language Models [65.94140459055244]
Hymba is a family of small language models featuring a hybrid-head parallel architecture.
We introduce learnable meta tokens that are prepended to prompts, storing critical information.
This model is further optimized by incorporating cross-layer key-value sharing and partial sliding window attention.
arXiv Detail & Related papers (2024-11-20T19:51:25Z) - Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining [49.730897226510095]
We introduce JOWA: Jointly-Reinforced World-Action model, an offline model-based RL agent pretrained on Atari games with 6 billion tokens data.<n>Our largest agent, with 150 million parameters, 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange.
arXiv Detail & Related papers (2024-10-01T10:25:03Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Falcon2-11B Technical Report [12.473984346805011]
We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and Falcon2-11B-vlm, which is a vision-to-text model.
We report our findings during the training of the Falcon2-11B which follows a multi-stage approach.
We also report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate.
arXiv Detail & Related papers (2024-07-20T14:23:15Z) - BitDelta: Your Fine-Tune May Only Be Worth One Bit [57.558376557639555]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.
We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance.
By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z) - The Falcon Series of Open Language Models [36.93493444130304]
We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora.
The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run.
Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1.
arXiv Detail & Related papers (2023-11-28T15:12:47Z) - Multiple Run Ensemble Learning withLow-Dimensional Knowledge Graph
Embeddings [4.317340121054659]
We propose a simple but effective performance boosting strategy for knowledge graph embedding (KGE) models.
We repeat the training of a model 6 times in parallel with an embedding size of 200 and then combine the 6 separate models for testing.
We show that our approach enables different models to better cope with their issues on modeling various graph patterns.
arXiv Detail & Related papers (2021-04-11T12:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.