P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics
- URL: http://arxiv.org/abs/2411.02933v1
- Date: Tue, 05 Nov 2024 09:23:27 GMT
- Title: P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics
- Authors: Yeasir Rayhan, Walid G. Aref,
- Abstract summary: This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores.
Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.
- Score: 3.6985496077087734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPU stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries are scheduled, and the locations of the data that the queries access. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores, and places data accordingly to certain integrated memory controllers (IMC), to integrate hardware consciousness into the system. In the spirit of hardware-software synergy, P-MOSS solely guides its scheduling decision based on low-level hardware statistics collected by performance monitoring counters with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B-tree and R-tree indexes. Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.
Related papers
- Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling [0.02091806248191979]
We introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators.<n>LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion.<n>We validate LIFE's forecasting with inference on AMD CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants.
arXiv Detail & Related papers (2025-07-29T03:08:31Z) - Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads [2.4012294360291477]
Per-Row Activation Counting (PRAC) modifies key DRAM timing parameters.<n>PRAC reportedly causes significant performance overheads in simulator-based studies.<n>We present the first real-machine performance analysis of PRAC.
arXiv Detail & Related papers (2025-07-08T00:38:44Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Towards On-Device Learning and Reconfigurable Hardware Implementation for Encoded Single-Photon Signal Processing [0.0]
We propose an online training algorithm based on a One-Sided Jacobi rotation-based Online Sequential Extreme Learning Machine (OSOS-ELM)
We fully exploit parallelism in executing OSOS-ELM on a heterogeneous FPGA with integrated ARM cores.
We validate our approach through three case studies involving single-photon signal analysis.
arXiv Detail & Related papers (2025-04-12T00:58:52Z) - TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [10.268774281394261]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage.
Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.
We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - Time-Series Forecasting and Sequence Learning Using Memristor-based Reservoir System [2.6473021051027534]
We develop a memristor-based echo state network accelerator that features efficient temporal data processing and in-situ online learning.
The proposed design is benchmarked using various datasets involving real-world tasks, such as forecasting the load energy consumption and weather conditions.
It is observed that the system demonstrates reasonable robustness for device failure below 10%, which may occur due to stuck-at faults.
arXiv Detail & Related papers (2024-05-22T05:07:56Z) - Communication-Efficient Graph Neural Networks with Probabilistic
Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs.
This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings.
We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Heterogeneous Data-Centric Architectures for Modern Data-Intensive
Applications: Case Studies in Machine Learning and Databases [9.927754948343326]
processing-in-memory (PIM) is a promising execution paradigm that alleviates the data movement bottleneck in modern applications.
In this paper, we show how to take advantage of the PIM paradigm for two modern data-intensive applications.
arXiv Detail & Related papers (2022-05-29T13:43:17Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel.
We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z) - Combining processing throughput, low latency and timing accuracy in
experiment control [0.0]
We ported the firmware of the ARTIQ experiment control infrastructure to an embedded system based on a commercial Xilinx Zynq-7000 system-on-chip.
It contains high-performance hardwired CPU cores integrated with FPGA fabric.
arXiv Detail & Related papers (2021-11-30T11:11:02Z) - Scheduling in Parallel Finite Buffer Systems: Optimal Decisions under
Delayed Feedback [29.177402567437206]
We present a partially observable (PO) model that captures the scheduling decisions in parallel queuing systems under limited information of delayed acknowledgements.
We numerically show that the resulting policy outperforms other limited information scheduling strategies.
We show how our approach can optimise the real-time parallel processing by using network data provided by Kaggle.
arXiv Detail & Related papers (2021-09-17T13:45:02Z) - Resistive Neural Hardware Accelerators [0.46198289193451136]
ReRAM-based in-memory computing has great potential in the implementation of area and power efficient inference.
The shift towards ReRAM-based in-memory computing has great potential in the implementation of area and power efficient inference.
In this survey, we review the state-of-the-art ReRAM-based Deep Neural Networks (DNNs) many-core accelerators.
arXiv Detail & Related papers (2021-09-08T21:11:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.