Related papers: On Abnormal Execution Timing of Conditional Jump Instructions

On Abnormal Execution Timing of Conditional Jump Instructions

URL: http://arxiv.org/abs/2601.11696v1
Date: Fri, 16 Jan 2026 18:30:09 GMT
Title: On Abnormal Execution Timing of Conditional Jump Instructions
Authors: Annika Wilde, Samira Briongos, Claudio Soriente, Ghassan Karame,
Abstract summary: We systematically measure and analyze timing variabilities in conditional jump instructions.<n>Our measurements indicate that these timing variations stem from the micro-op cache placement and the jump's offset in the L1 instruction cache of modern processors.<n>We show that this variability can be exploited as a covert channel, achieving a maximum throughput of 16.14 Mbps.
Score: 5.661457012631801
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An extensive line of work on modern computing architectures has shown that the execution time of instructions can (i) depend on the operand of the instruction or (ii) be influenced by system optimizations, e.g., branch prediction and speculative execution paradigms. In this paper, we systematically measure and analyze timing variabilities in conditional jump instructions that can be macro-fused with a preceding instruction, depending on their placement within the binary. Our measurements indicate that these timing variations stem from the micro-op cache placement and the jump's offset in the L1 instruction cache of modern processors. We demonstrate that this behavior is consistent across multiple microarchitectures, including Skylake, Coffee Lake, and Kaby Lake, as well as various real-world implementations. We confirm the prevalence of this variability through extensive experiments on a large-scale set of popular binaries, including libraries from Ubuntu 24.04, Windows 10 Pro, and several open-source cryptographic libraries. We also show that one can easily avoid this timing variability by ensuring that macro-fusible instructions are 32-byte aligned - an approach initially suggested in 2019 by Intel in an overlooked short report. We quantify the performance impact of this approach across the cryptographic libraries, showing a speedup of 2.15% on average (and up to 10.54%) when avoiding the timing variability. As a by-product, we show that this variability can be exploited as a covert channel, achieving a maximum throughput of 16.14 Mbps.

Related papers

dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference [31.2331188304598]
Changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses.<n>We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z)
SMaCk: Efficient Instruction Cache Attacks via Self-Modifying Code Conflicts [5.942801930997087]
Self-modifying code (SMC) allows programs to alter their own instructions.<n>SMC introduces unique microarchitectural behaviors that can be exploited for malicious purposes.
arXiv Detail & Related papers (2025-02-08T03:35:55Z)
Discovery of Endianness and Instruction Size Characteristics in Binary Programs from Unknown Instruction Set Architectures [0.0]
We study the problem of streamlining reverse engineering of binary programs from unknown instruction set architectures (ISA) We focus on two fundamental ISA characteristics to beginning the RE process: identification of endianness and whether the instruction width is a fixed or variable. We use bigram-based features for endianness detection and the autocorrelation function, commonly used in signal processing applications, for differentiation between fixed- and variable-width instruction sizes.
arXiv Detail & Related papers (2024-10-28T21:43:53Z)
Breaking Bad: How Compilers Break Constant-Time Implementations [8.771587132463535]
We investigate how compilers break protections introduced by defensive programming techniques.<n>We run a large-scale experiment to see if such compiler-induced issues manifest in state-of-the-art cryptographic libraries.<n>Our study reveals that several compiler-induced secret-dependent operations occur within some of the most highly regarded cryptographic libraries.
arXiv Detail & Related papers (2024-10-17T12:34:02Z)
Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)<n>PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.<n>Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [23.633481089469836]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.<n>We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.<n>Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
A Scalable Formal Verification Methodology for Data-Oblivious Hardware [3.518548208712866]
We propose a novel methodology to formally verify data-oblivious behavior in hardware using standard property checking techniques. We show that proving this inductive property is sufficient to exhaustively verify data-obliviousness at the microarchitectural level. One case study uncovered a data-dependent timing violation in the extensively verified and highly secure IBEX RISC-V core.
arXiv Detail & Related papers (2023-08-15T13:19:17Z)
Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain. We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
A High Performance Compiler for Very Large Scale Surface Code Computations [38.26470870650882]
We present the first high performance compiler for very large scale quantum error correction. It translates an arbitrary quantum circuit to surface code operations based on lattice surgery. The compiler can process millions of gates using a streaming pipeline at a speed geared towards real-time operation of a physical device.
arXiv Detail & Related papers (2023-02-05T19:06:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.