Related papers: Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

URL: http://arxiv.org/abs/2602.21196v1
Date: Tue, 24 Feb 2026 18:54:39 GMT
Title: Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking
Authors: Ravi Ghadia, Maksim Abraham, Sergei Vorobyov, Max Ryabinin,
Abstract summary: We present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level.<n>Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5$%$ for 32B Transformers.<n>UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8$times$H100 node.
Score: 8.535396390209103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5$\%$ for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8$\times$H100 node, improving upon prior methods by over 25$\%$.

Related papers

Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts [68.79341332280062]
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.<n>We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier.<n>Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint.
arXiv Detail & Related papers (2026-02-02T13:52:40Z)
From Projection to Prediction: Beyond Logits for Scalable Language Models [0.28647133890966986]
Training Large Language Models (LLMs) typically involves a two-stage pipeline at the output layer.<n>By directly computing the loss from hidden states and target tokens, our approach bypasses explicit logits materialization.
arXiv Detail & Related papers (2025-11-18T02:23:47Z)
TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts [5.585952216289788]
Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity.<n>Recurrent Memory (RMTs) offer a solution by reducing the cost linear time and constant memory usage.<n>But their memory update mechanism leads to sequential execution causing a performance bottleneck.<n>We introduce Diagonal, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence.
arXiv Detail & Related papers (2025-06-05T16:43:48Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.