Related papers: ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

URL: http://arxiv.org/abs/2511.13198v1
Date: Mon, 17 Nov 2025 10:08:24 GMT
Title: ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer
Authors: Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao,
Abstract summary: ParaDySe is a novel adaptive Parallel strategy switching framework for Dynamic Sequences.<n>ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence.
Score: 8.224366948749838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries. We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
TIDE: Tuning-Integrated Dynamic Evolution for LLM-Based Automated Heuristic Design [7.264986493460248]
TIDE is a Tuning-Integrated Dynamic Evolution framework designed to decouple structural reasoning from parameter optimization.<n> experiments across nine optimization problems demonstrate that TIDE significantly outperforms state-of-the-art tuning methods.
arXiv Detail & Related papers (2026-01-29T04:00:02Z)
Experience-Guided Adaptation of Inference-Time Reasoning Strategies [49.954515048847874]
Experience-Guided Reasoner (EGuR) generates tailored strategies at inference time based on accumulated experience.<n>EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x.
arXiv Detail & Related papers (2025-11-14T17:45:28Z)
Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z)
Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling [0.0]
Gated Associative Memory (GAM) network is a novel, fully parallel architecture for sequence modeling.<n>We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline.<n>Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets.
arXiv Detail & Related papers (2025-08-30T20:59:46Z)
LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z)
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism [33.23902060961886]
Existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences.<n>We show that the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution.<n>We propose a Heterogeneous-adaptive sequence parallelism method to address this problem.
arXiv Detail & Related papers (2024-12-02T14:16:03Z)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization [13.750841199401613]
We propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA) We present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework.
arXiv Detail & Related papers (2023-10-09T03:24:51Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.