CHERI Performance Enhancement for a Bytecode Interpreter
- URL: http://arxiv.org/abs/2308.05076v2
- Date: Tue, 12 Sep 2023 20:19:43 GMT
- Title: CHERI Performance Enhancement for a Bytecode Interpreter
- Authors: Duncan Lowther, Dejice Jacob, Jeremy Singer
- Abstract summary: We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits)
The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During our port of the MicroPython bytecode interpreter to the CHERI-based
Arm Morello platform, we encountered a number of serious performance
degradations. This paper explores several of these performance issues in
detail, in each case we characterize the cause of the problem, the fix, and the
corresponding interpreter performance improvement over a set of standard Python
benchmarks.
While we recognize that Morello is a prototypical physical instantiation of
the CHERI concept, we show that it is possible to eliminate certain kinds of
software-induced runtime overhead that occur due to the larger size of CHERI
capabilities (128 bits) relative to native pointers (generally 64 bits). In our
case, we reduce a geometric mean benchmark slowdown from 5x (before
optimization) to 1.7x (after optimization) relative to AArch64, non-capability,
execution. The worst-case slowdowns are greatly improved, from 100x (before
optimization) to 2x (after optimization).
The key insight is that implicit pointer size presuppositions pervade systems
code; whereas previous CHERI porting projects highlighted compile-time and
execution-time errors exposed by pointer size assumptions, we instead focus on
the performance implications of such assumptions.
Related papers
- An Effectively $Ω(c)$ Language and Runtime [0.0]
Good performance of an application is conceptually more of a binary function.
Our vision is to create a language and runtime that is designed to be $Omega(c)$ in its performance.
arXiv Detail & Related papers (2024-09-30T16:57:45Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers [0.0]
Large Language Models (LLMs) raise intriguing questions about the potential for AI-driven approaches to revolutionize code optimization methodologies.
This paper presents a comparative analysis between two state-of-the-art Large Language Models, GPT-4.0 and CodeLlama-70B, and traditional optimizing compilers.
arXiv Detail & Related papers (2024-06-17T23:26:41Z) - Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp [0.3749861135832073]
This article optimize the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama, and modifying the compilation script.
On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.
arXiv Detail & Related papers (2024-06-16T06:46:25Z) - BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models [77.0501668780182]
Retrieval augmentation addresses many critical problems in large language models.
Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text.
We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
arXiv Detail & Related papers (2023-10-02T16:48:47Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - POSET-RL: Phase ordering for Optimizing Size and Execution Time using
Reinforcement Learning [0.0]
We present a reinforcement learning based solution to the phase ordering problem.
We propose two approaches to model the sequences: one by manual ordering, and other based on a graph called Oz Dependence Graph (ODG)
arXiv Detail & Related papers (2022-07-27T08:32:23Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z) - Enabling Fast Differentially Private SGD via Just-in-Time Compilation
and Vectorization [8.404254529115835]
A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Gradient Descent (DPSGD)
We demonstrate that by exploiting powerful language primitives, one can dramatically reduce these overheads, in many cases nearly matching the best non-private running times.
arXiv Detail & Related papers (2020-10-18T18:45:04Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.