Related papers: Mercury: Ultra-Fast Language Models Based on Diffusion

Mercury: Ultra-Fast Language Models Based on Diffusion

URL: http://arxiv.org/abs/2506.17298v1
Date: Tue, 17 Jun 2025 17:06:18 GMT
Title: Mercury: Ultra-Fast Language Models Based on Diffusion
Authors: Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, Volodymyr Kuleshov,
Abstract summary: We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion.<n>Mercury Coder comes in two sizes: Mini and Small.<n>Based on independent evaluations, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively.
Score: 58.52391675075641
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai

Related papers

Moirai 2.0: When Less Is More for Time Series Forecasting [91.36760228926214]
Moirai 2.0 is a decoder-only foundation model trained on a new corpus of 36M series.<n>It ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size.<n>In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large.
arXiv Detail & Related papers (2025-11-12T12:15:35Z)
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference [58.06027151683975]
We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed.<n>Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding.
arXiv Detail & Related papers (2025-08-04T08:43:01Z)
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation [0.0]
Tiny QA Benchmark++ (TQB++) is designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost.<n>TQB++ couples a 52-item English gold set with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM.<n>Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools.
arXiv Detail & Related papers (2025-05-17T15:40:03Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
Jasper and Stella: distillation of SOTA embedding models [8.708650717134008]
We propose a novel multi-stage distillation framework that enables a smaller student embedding model to distill multiple teacher embedding models.<n>We utilize Matryoshka Representation Learning (MRL) to reduce the vector dimensionality of the student embedding model effectively.<n>Our student model named Jasper with 2 billion parameters, built upon the Stella embedding model, obtained the No.3 position on the Massive Text Embedding Benchmark leaderboard.
arXiv Detail & Related papers (2024-12-26T04:05:28Z)
AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration [0.3626013617212667]
We introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that accelerates generation by decoupling the draft and verify phases. Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices. We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$times$ speedup over conventional autoregressive decoding.
arXiv Detail & Related papers (2024-10-22T19:15:35Z)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task.<n>We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics.<n>Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z)
StarCoder 2 and The Stack v2: The Next Generation [105.93298676368798]
We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens. We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
arXiv Detail & Related papers (2024-02-29T13:53:35Z)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z)
SantaCoder: don't reach for the stars! [27.050410834027705]
The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. Our best model outperforms previous open-source multilingual code generation models in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E.
arXiv Detail & Related papers (2023-01-09T10:52:35Z)
LegoNN: Building Modular Encoder-Decoder Models [117.47858131603112]
State-of-the-art encoder-decoder models are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for fine-tuning.
arXiv Detail & Related papers (2022-06-07T14:08:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.