An Effectively $Ω(c)$ Language and Runtime
- URL: http://arxiv.org/abs/2409.20494v1
- Date: Mon, 30 Sep 2024 16:57:45 GMT
- Title: An Effectively $Ω(c)$ Language and Runtime
- Authors: Mark Marron,
- Abstract summary: Good performance of an application is conceptually more of a binary function.
Our vision is to create a language and runtime that is designed to be $Omega(c)$ in its performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of an application/runtime is usually thought of as a continuous function where, the lower the amount of memory/time used on a given workload, then the better the compiler/runtime is. However, in practice, good performance of an application is conceptually more of a binary function -- either the application responds in under, say 100ms, and is fast enough for a user to barely notice, or it takes a noticeable amount of time, leaving the user waiting and potentially abandoning the task. Thus, performance really means how often the application is fast enough to be usable, leading industrial developers to focus on the 95th and 99th percentile latencies as heavily, or moreso, than average response time. Unfortunately, tracking and optimizing for these high percentile latencies is difficult and often requires a deep understanding of the application, runtime, GC, and OS interactions. This is further complicated by the fact that tail performance is often only seen occasionally, and is specific to a certain workload or input, making these issues uniquely painful to handle. Our vision is to create a language and runtime that is designed to be $\Omega(c)$ in its performance -- that is, it is designed to have an effectively constant time to execute all operations, there is a constant fixed memory overhead for the application footprint, and the garbage-collector performs a constant amount of work per allocation + a (small) bounded pause for all collection/release operations.
Related papers
- eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes [3.804314901623159]
WebAssembly (Wasm) is a low-level bytecode format that can run in modern browsers.
We propose an eBPF-based WASI performance analysis framework.
It collects key performance metrics of the runtime under different I/O load conditions, such as total execution time, startup time, WASI execution time, and syscall time.
arXiv Detail & Related papers (2024-09-16T13:03:09Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows [0.792324422300924]
We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries.
In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity.
We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently.
arXiv Detail & Related papers (2024-02-27T16:21:28Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models [77.0501668780182]
Retrieval augmentation addresses many critical problems in large language models.
Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text.
We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
arXiv Detail & Related papers (2023-10-02T16:48:47Z) - Introducing Language Guidance in Prompt-based Continual Learning [95.03110230754423]
We propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods.
LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art.
arXiv Detail & Related papers (2023-08-30T08:03:49Z) - CHERI Performance Enhancement for a Bytecode Interpreter [0.0]
We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits)
The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
arXiv Detail & Related papers (2023-08-09T17:12:23Z) - LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens.
Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction.
memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z) - POSET-RL: Phase ordering for Optimizing Size and Execution Time using
Reinforcement Learning [0.0]
We present a reinforcement learning based solution to the phase ordering problem.
We propose two approaches to model the sequences: one by manual ordering, and other based on a graph called Oz Dependence Graph (ODG)
arXiv Detail & Related papers (2022-07-27T08:32:23Z) - GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural
Networks for Dead Store Detection [4.82596017481926]
We propose a learning-grained approach to identify unnecessary memory operations intelligently with low overhead.
By applying several prevalent graph neural network models to extract program semantics, we present a novel, hybrid program embedding approach.
Results show that our model achieves 90% of accuracy and incurs only around a half of time overhead of the state-of-art tool.
arXiv Detail & Related papers (2020-11-18T19:17:15Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.