Related papers: STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design

STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design

URL: http://arxiv.org/abs/2511.10480v2
Date: Fri, 14 Nov 2025 17:58:00 GMT
Title: STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
Authors: Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna,
Abstract summary: Symbolic(STAGE) is a framework that synthesizes high-fidelity execution traces to accurately model workload execution.<n>It supports a comprehensive set of parallelization strategies, allowing users to explore a wide spectrum of LLM architectures and system configurations.
Score: 6.182971013882298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph

Related papers

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models [78.73992315826035]
We introduce Youtu-LLM, a lightweight language model that harmonizes high computational efficiency with native agentic intelligence.<n>Youtu-LLM is pre-trained from scratch to systematically cultivate reasoning and planning capabilities.
arXiv Detail & Related papers (2025-12-31T04:25:11Z)
AI for Distributed Systems Design: Scalable Cloud Optimization Through Repeated LLMs Sampling And Simulators [3.1594665317979698]
We explore AI-driven distributed-systems policy design by combining code generation from large language models with deterministic verification in a domain-specific simulator.<n>We report preliminary results on throughput improvements across multiple models.<n>We conjecture that AI will be crucial for scaling this methodology by helping to bootstrap new simulators.
arXiv Detail & Related papers (2025-10-20T16:10:24Z)
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective [6.51239603014107]
Large Language Models (LLMs) have pushed training workloads beyond the limits of single-node analysis.<n>We present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms.
arXiv Detail & Related papers (2025-09-12T16:05:07Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Scaling Intelligence: Designing Data Centers for Next-Gen Language Models [0.6168147650666682]
Large Language Models (LLMs), such as GPT-4 with 1.8 trillion parameters, demand a fundamental rethinking of data center architecture.<n>Our work provides a comprehensive co-design framework that jointly explores FLOPS, bandwidth and capacity, multiple network topologies.<n>We quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, widening the scale-out domain, and increasing memory capacity.
arXiv Detail & Related papers (2025-06-17T22:29:37Z)
Understanding and Optimizing Multi-Stage AI Inference Pipelines [11.254219071373319]
HERMES is a Heterogeneous Multi-stage LLM inference Execution Simulator.<n> HERMES supports heterogeneous clients executing multiple models concurrently unlike prior frameworks.<n>We explore the impact of reasoning stages on end-to-end latency, optimal strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval.
arXiv Detail & Related papers (2025-04-14T00:29:49Z)
Low-Rank Adapters Meet Neural Architecture Search for LLM Compression [1.8434042562191815]
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment.<n>Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models.<n>This paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques.
arXiv Detail & Related papers (2025-01-23T02:14:08Z)
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models.<n>Our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID.
arXiv Detail & Related papers (2025-01-01T15:58:51Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale [5.217665236079274]
We extend the open-source ASTRA-sim infrastructure to model state-of-the-art and emerging distributed training models and platforms. We run comprehensive case studies targeting emerging distributed models and platforms.
arXiv Detail & Related papers (2023-03-24T14:00:18Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.