Related papers: SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

URL: http://arxiv.org/abs/2603.03592v1
Date: Tue, 03 Mar 2026 23:51:10 GMT
Title: SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training
Authors: Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputugodage, Gil Avraham, Yan Zuo, Violetta Shevchenko, Alexander Long,
Abstract summary: Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
Score: 54.8494905524997
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

Related papers

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training [18.849117699859622]
Training stability is a central challenge in reinforcement learning for large language models.<n>We propose Variational sEquence-level Soft Policy Optimization (VESPO)<n> Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution.
arXiv Detail & Related papers (2026-02-11T09:48:08Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning [56.47948583452555]
We introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme aligns it with the variational Jordan-Kinderlehrer-Otto principle from optimal transport.<n>SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions.<n>This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages.
arXiv Detail & Related papers (2025-10-17T07:43:51Z)
Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning [9.245468958723182]
We study distributed data parallel of deep neural networks (DNNs) to improve the trade-off between communication efficiency and model performance.<n>We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, demonstrate its strong correlation with generalization of DNNs.<n>We show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local methods and gradient averaging.
arXiv Detail & Related papers (2025-07-27T21:49:49Z)
Local Pairwise Distance Matching for Backpropagation-Free Reinforcement Learning [0.9065034043031668]
Training neural networks with reinforcement learning (RL) typically relies on backpropagation (BP)<n>BP requires storage of activations from the forward pass for subsequent backward updates.<n>We propose a novel approach that trains each layer of the neural network using local signals during the forward pass in RL settings.
arXiv Detail & Related papers (2025-07-15T14:39:41Z)
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network [21.231881562816373]
We introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism.<n>Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers.
arXiv Detail & Related papers (2025-06-02T06:13:41Z)
Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z)
Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models [63.1637853118899]
We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models. We employ a shift control module that works on h-space of pre-trained diffusion models to manipulate a sample into a shifted version of itself. By jointly optimizing them, the model will spontaneously discover disentangled and interpretable directions.
arXiv Detail & Related papers (2023-10-15T18:44:30Z)
Distribution Mismatch Correction for Improved Robustness in Deep Neural Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions. We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer. In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z)
HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring [18.8426865970643]
A novel Hierarchical Parallel SGD (HPSGD) strategy is proposed to boost the distributed training process of the deep neural network (DNN) Experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.
arXiv Detail & Related papers (2020-09-06T10:17:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.