Related papers: ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive

URL: http://arxiv.org/abs/2508.18850v1
Date: Tue, 26 Aug 2025 09:29:23 GMT
Title: ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
Authors: Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, Minyi Guo,
Abstract summary: Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators.<n>We introduce two cluster-level communication primitives, ClusterGather and ClusterFusion.<n>We design ClusterFusion, an execution framework that schedules communication and jointly to expand operator fusion scope.
Score: 38.22906887556149
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) decoding suffers from high latency due to fragmented execution across operators and heavy reliance on off-chip memory for data exchange and reduction. This execution model limits opportunities for fusion and incurs significant memory traffic and kernel launch overhead. While modern architectures such as NVIDIA Hopper provide distributed shared memory and low-latency intra-cluster interconnects, they expose only low-level data movement instructions, lacking structured abstractions for collective on-chip communication. To bridge this software-hardware gap, we introduce two cluster-level communication primitives, ClusterReduce and ClusterGather, which abstract common communication patterns and enable structured, high-speed data exchange and reduction between thread blocks within a cluster, allowing intermediate results to be on-chip without involving off-chip memory. Building on these abstractions, we design ClusterFusion, an execution framework that schedules communication and computation jointly to expand operator fusion scope by composing decoding stages such as QKV Projection, Attention, and Output Projection into a single fused kernels. Evaluations on H100 GPUs show that ClusterFusion outperforms state-of-the-art inference frameworks by 1.61x on average in end-to-end latency across different models and configurations. The source code is available at https://github.com/xinhao-luo/ClusterFusion.

Related papers

In-Context Clustering with Large Language Models [50.25868718329313]
ICC captures complex relationships among inputs through an attention mechanism.<n>We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data.<n>Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering.
arXiv Detail & Related papers (2025-10-09T17:07:55Z)
Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency [57.961869351897384]
We propose a framework based on cross-modal semantic consistency for efficient image clustering.<n>Our framework first builds a strong foundation via Cross-Modal Semantic Consistency.<n>In the first stage, we train lightweight clustering heads to align with the rich semantics of the pre-trained model.<n>In the second stage, we introduce a Self-Enhanced fine-tuning strategy.
arXiv Detail & Related papers (2025-08-02T08:12:57Z)
DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures.<n>With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z)
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z)
P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation [8.46409964236009]
Diffusion models and multi-scale features are essential components in semantic segmentation tasks. We propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches. Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets.
arXiv Detail & Related papers (2024-05-30T19:40:08Z)
V2X-PC: Vehicle-to-everything Collaborative Perception via Point Cluster [58.79477191603844]
We introduce a new message unit, namely point cluster, to represent the scene sparsely with a combination of low-level structure information and high-level semantic information. This framework includes a Point Cluster Packing (PCP) module to keep object feature and manage bandwidth. Experiments on two widely recognized collaborative perception benchmarks showcase the superior performance of our method compared to the previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-25T11:24:02Z)
One-Step Late Fusion Multi-view Clustering with Compressed Subspace [29.02032034647922]
We propose an integrated framework named One-Step Late Fusion Multi-view Clustering with Compressed Subspace (OS-LFMVC-CS) We use the consensus subspace to align the partition matrix while optimizing the partition fusion, and utilize the fused partition matrix to guide the learning of discrete labels.
arXiv Detail & Related papers (2024-01-03T06:18:30Z)
A Joint Gradient and Loss Based Clustered Federated Learning Design [26.54703150478879]
A novel clustered FL framework that enables distributed edge devices with non-IID data to independently form several clusters is proposed. By delegating clustering decisions to edge devices, each device can fully leverage its private data information to determine its own cluster identity. Simulation results demonstrate that our proposed clustered FL algorithm can reduce clustering iterations by up to 99% compared to the existing baseline.
arXiv Detail & Related papers (2023-11-22T19:39:37Z)
Deep Attention-guided Graph Clustering with Dual Self-supervision [49.040136530379094]
We propose a novel method, namely deep attention-guided graph clustering with dual self-supervision (DAGC) We develop a dual self-supervision solution consisting of a soft self-supervision strategy with a triplet Kullback-Leibler divergence loss and a hard self-supervision strategy with a pseudo supervision loss. Our method consistently outperforms state-of-the-art methods on six benchmark datasets.
arXiv Detail & Related papers (2021-11-10T06:53:03Z)
CaEGCN: Cross-Attention Fusion based Enhanced Graph Convolutional Network for Clustering [51.62959830761789]
We propose a cross-attention based deep clustering framework, named Cross-Attention Fusion based Enhanced Graph Convolutional Network (CaEGCN) CaEGCN contains four main modules: cross-attention fusion, Content Auto-encoder, Graph Convolutional Auto-encoder and self-supervised model. Experimental results on different types of datasets prove the superiority and robustness of the proposed CaEGCN.
arXiv Detail & Related papers (2021-01-18T05:21:59Z)
A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems [15.913119724815733]
High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems. Parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections. We propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems.
arXiv Detail & Related papers (2020-10-09T07:54:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.