Related papers: LightCode: Compiling LLM Inference for Photonic-Electronic Systems

LightCode: Compiling LLM Inference for Photonic-Electronic Systems

URL: http://arxiv.org/abs/2509.16443v1
Date: Fri, 19 Sep 2025 21:45:26 GMT
Title: LightCode: Compiling LLM Inference for Photonic-Electronic Systems
Authors: Ryan Tomich, Zhizhen Zhong, Dirk Englund,
Abstract summary: LightCode is a compiler framework and simulator for mapping large language models (LLMs) to photonic-electronic systems.<n>We introduce the Stacked Graph, an intermediate representation that encodes hardware-specific realizations of each tensor operation.<n>We show that Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length.
Score: 0.26068343017240947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing demand for low-latency, energy-efficient inference in large language models (LLMs) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping LLM inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the prefill stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling LLMs to emerging photonic accelerators.

Related papers

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective [6.51239603014107]
Large Language Models (LLMs) have pushed training workloads beyond the limits of single-node analysis.<n>We present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms.
arXiv Detail & Related papers (2025-09-12T16:05:07Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
What Is Next for LLMs? Next-Generation AI Computing Hardware Using Photonic Chips [34.52960723566363]
Large language models (LLMs) are rapidly pushing the limits of contemporary computing hardware.<n>This review surveys emerging photonic hardware optimized for next-generation generative AI computing.
arXiv Detail & Related papers (2025-05-09T05:19:14Z)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System [13.678531084541666]
We propose PAPI, a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units.<n>PAPI achieves 1.8$times$ and 11.1$times$ speed over a state-of-the-art heterogeneous accelerator and a state-of-the-art PIM-only accelerator.
arXiv Detail & Related papers (2025-02-21T13:52:31Z)
Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs [0.8217552831952]
Large language models (LLMs) have transformed the way we think about language understanding and generation.<n>Group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process.<n>We present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions.
arXiv Detail & Related papers (2024-12-23T03:44:29Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval [80.96706764868898]
We present a new Low-light Image Enhancement (LLIE) network via Generative LAtent feature based codebook REtrieval (GLARE) We develop a generative Invertible Latent Normalizing Flow (I-LNF) module to align the LL feature distribution to NL latent representations, guaranteeing the correct code retrieval in the codebook. Experiments confirm the superior performance of GLARE on various benchmark datasets and real-world data.
arXiv Detail & Related papers (2024-07-17T09:40:15Z)
DEAP: Design Space Exploration for DNN Accelerator Parallelism [0.0]
Large Language Models (LLMs) are becoming increasingly complex and powerful to train and serve. This paper showcases how hardware and software co-design can come together and allow us to create customized hardware systems.
arXiv Detail & Related papers (2023-12-24T02:43:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Interleaving: Modular architectures for fault-tolerant photonic quantum computing [50.591267188664666]
Photonic fusion-based quantum computing (FBQC) uses low-loss photonic delays. We present a modular architecture for FBQC in which these components are combined to form "interleaving modules" Exploiting the multiplicative power of delays, each module can add thousands of physical qubits to the computational Hilbert space.
arXiv Detail & Related papers (2021-03-15T18:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.