Related papers: Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads

Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads

URL: http://arxiv.org/abs/2507.05556v1
Date: Tue, 08 Jul 2025 00:38:44 GMT
Title: Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads
Authors: Jumin Kim, Seungmin Baek, Minbok Wi, Hwayong Nam, Michael Jaemin Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn,
Abstract summary: Per-Row Activation Counting (PRAC) modifies key DRAM timing parameters.<n>PRAC reportedly causes significant performance overheads in simulator-based studies.<n>We present the first real-machine performance analysis of PRAC.
Score: 2.4012294360291477
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC's average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads -- up to 9.15x lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.

Related papers

Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation [4.573673188291683]
We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.<n>xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.<n>We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
arXiv Detail & Related papers (2025-03-18T23:15:02Z)
Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z)
P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics [3.6985496077087734]
This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores. Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.
arXiv Detail & Related papers (2024-11-05T09:23:27Z)
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z)
Understanding the Security Benefits and Overheads of Emerging Industry Solutions to DRAM Read Disturbance [6.637143975465625]
Per Row Activation Counting (PRAC) mitigation method described in JEDEC DDR5 specification's April 2024 update. Back-off signal propagates from the DRAM chip to the memory controller. RFM commands are issued when needed as opposed to periodically, reducing RFM's overheads.
arXiv Detail & Related papers (2024-06-27T11:22:46Z)
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors [44.5740422079]
We show that pretraining with standard denoising objectives leads to dramatic gains across multiple architectures. In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained.
arXiv Detail & Related papers (2023-10-04T17:17:06Z)
Re-Evaluating LiDAR Scene Flow for Autonomous Driving [80.37947791534985]
Popular benchmarks for self-supervised LiDAR scene flow have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns. We evaluate a suite of top methods on a suite of real-world datasets. We show that despite the emphasis placed on learning, most performance gains are caused by pre- and post-processing steps.
arXiv Detail & Related papers (2023-04-04T22:45:50Z)
Learning to Rank Graph-based Application Objects on Heterogeneous Memories [0.0]
This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance. By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline's approach.
arXiv Detail & Related papers (2022-11-04T00:20:31Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing [0.5257115841810257]
ATRIA is a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for high-speed inference of CNNs. We show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency.
arXiv Detail & Related papers (2021-05-26T18:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.