Related papers: Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

URL: http://arxiv.org/abs/2503.01868v1
Date: Tue, 25 Feb 2025 19:47:20 GMT
Title: Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale
Authors: Jerome Ku, Eric Nguyen, David W. Romero, Garyk Brixi, Brandon Yang, Anton Vorontsov, Ali Taghibakhshi, Amy X. Lu, Dave P. Burke, Greg Brockman, Stefano Massaroli, Christopher Ré, Patrick D. Hsu, Brian L. Hie, Stefano Ermon, Michael Poli,
Abstract summary: We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
Score: 68.6602625868888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.

Related papers

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models [36.0400717590138]
We present OmniMamba, the first linear-architecture-based multimodal generation model. It generates both text and images through a unified next-token prediction paradigm. It achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks.
arXiv Detail & Related papers (2025-03-11T17:59:46Z)
JAX-Fluids 2.0: Towards HPC for Differentiable CFD of Compressible Two-phase Flows [0.0]
JAX-Fluids is a Python-based fully-differentiable CFD solver designed for compressible single- and two-phase flows. We introduce a parallelization strategy utilizing JAX primitive operations that scales efficiently on GPU (up to 512 NVIDIA A100 graphics cards) and TPU (up to 1024 TPU v3 cores) HPC systems. The new code version offers enhanced two-phase flow modeling capabilities.
arXiv Detail & Related papers (2024-02-07T19:05:27Z)
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z)
Boosting the effective performance of massively parallel tensor network state algorithms on hybrid CPU-GPU based architectures via non-Abelian symmetries [0.0]
Non-Abelian symmetry related tensor algebra based on Wigner-Eckhart theorem is fully detached from the conventional tensor network layer. We have achieved an order of magnitude increase in performance with respect to results reported in arXiv:2305.05581 in terms of computational complexity. Our solution has an estimated effective performance of 250-500 TFLOPS.
arXiv Detail & Related papers (2023-09-23T07:49:53Z)
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs [1.7481226034111275]
This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. It achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
arXiv Detail & Related papers (2023-05-22T22:41:49Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
Hyena Hierarchy: Towards Larger Convolutional Language Models [115.82857881546089]
Hyena is a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods.
arXiv Detail & Related papers (2023-02-21T18:29:25Z)
On Optimizing the Communication of Model Parallelism [74.15423270435949]
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL) In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh. We propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule.
arXiv Detail & Related papers (2022-11-10T03:56:48Z)
X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation. The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.