Related papers: Inter-Layer Scheduling Space Exploration for Multi-model Inference on Heterogeneous Chiplets

Related papers

CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers [72.23291099555459]
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures.<n>This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism.<n>ChoRDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation.
arXiv Detail & Related papers (2025-07-21T05:48:47Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [50.438552588818]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration [1.9965524232168244]
This paper presents an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform. We propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.
arXiv Detail & Related papers (2025-04-24T08:50:01Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression. We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed. It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs) The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE. Our results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
Rapid and Power-Aware Learned Optimization for Modular Receive Beamforming [27.09017677987757]
Multiple-input multiple-output (MIMO) systems play a key role in wireless communication technologies. We propose a power-oriented optimization algorithm for beamforming in modular hybrid systems. We show how power efficient beamforming can be encouraged by the learned, via boosting computation with low-resolution phase shifts.
arXiv Detail & Related papers (2024-08-01T10:19:25Z)
SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators [12.416683044819955]
Multi-model workloads with heavy models like recent large language models significantly increased the compute and memory demands on hardware. To address such increasing demands, designing a scalable hardware architecture became a key problem. We develop a set of schedulers to navigate the huge scheduling space and codify them into a scheduler, SCAR, with advanced techniques such as inter-chiplet pipelining.
arXiv Detail & Related papers (2024-05-01T18:02:25Z)
Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures [5.46396577345121]
complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. We propose a novel memory arrangement strategy, governed by the hardware accelerator's kernel size, which effectively minimizes off-chip data access. Our approach can achieve up to a 2.8x speed increase when executing inferences employing state-of-the-art transformers.
arXiv Detail & Related papers (2023-12-20T13:01:25Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction [73.25645602768158]
IPCC-TP is a novel relevance-aware module based on Incremental Pearson Correlation Coefficient to improve multi-agent interaction modeling. Our module can be conveniently embedded into existing multi-agent prediction methods to extend original motion distribution decoders.
arXiv Detail & Related papers (2023-03-01T15:16:56Z)
Collaborative Intelligent Reflecting Surface Networks with Multi-Agent Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks. In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z)
Proximal Policy Optimization-based Transmit Beamforming and Phase-shift Design in an IRS-aided ISAC System for the THz Band [90.45915557253385]
IRS-aided integrated sensing and communications (ISAC) system operating in the terahertz (THz) band is proposed to maximize the system capacity. Transmit beamforming and phase-shift design are transformed into a universal optimization problem with ergodic constraints.
arXiv Detail & Related papers (2022-03-21T09:15:18Z)
Data-Driven Deep Learning Based Hybrid Beamforming for Aerial Massive MIMO-OFDM Systems with Implicit CSI [29.11998008894847]
We propose a data-driven deep learning-based unified hybrid beamforming framework for time division duplex and frequency division duplex systems. For TDD systems, the proposed DL-based approach jointly models the uplink pilot combining and downlink hybrid beamforming modules as an E2E neural network. While for FDD systems, we jointly model the downlink pilot transmission, uplink CSI feedback, and downlink hybrid beamforming modules as an E2E neural network.
arXiv Detail & Related papers (2022-01-18T07:21:00Z)
SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge Devices [69.1412199244903]
We present a multi-tenant runtime for adaptive model execution with integrated MLOps on edge devices, e.g., a camera, a microphone, or IoT sensors. S SensiX++ operates on two fundamental principles - highly modular componentisation to externalise data operations with clear abstractions and document-centric manifestation for system-wide orchestration. We report on the overall throughput and quantified benefits of various automation components of SensiX++ and demonstrate its efficacy to significantly reduce operational complexity and lower the effort to deploy, upgrade, reconfigure and serve embedded models on edge devices.
arXiv Detail & Related papers (2021-09-08T22:06:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.