Related papers: An Effectively $Ω(c)$ Language and Runtime

An Effectively $Ω(c)$ Language and Runtime

URL: http://arxiv.org/abs/2409.20494v1
Date: Mon, 30 Sep 2024 16:57:45 GMT
Title: An Effectively $Ω(c)$ Language and Runtime
Authors: Mark Marron,
Abstract summary: Good performance of an application is conceptually more of a binary function. Our vision is to create a language and runtime that is designed to be $Omega(c)$ in its performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance of an application/runtime is usually thought of as a continuous function where, the lower the amount of memory/time used on a given workload, then the better the compiler/runtime is. However, in practice, good performance of an application is conceptually more of a binary function -- either the application responds in under, say 100ms, and is fast enough for a user to barely notice, or it takes a noticeable amount of time, leaving the user waiting and potentially abandoning the task. Thus, performance really means how often the application is fast enough to be usable, leading industrial developers to focus on the 95th and 99th percentile latencies as heavily, or moreso, than average response time. Unfortunately, tracking and optimizing for these high percentile latencies is difficult and often requires a deep understanding of the application, runtime, GC, and OS interactions. This is further complicated by the fact that tail performance is often only seen occasionally, and is specific to a certain workload or input, making these issues uniquely painful to handle. Our vision is to create a language and runtime that is designed to be $\Omega(c)$ in its performance -- that is, it is designed to have an effectively constant time to execute all operations, there is a constant fixed memory overhead for the application footprint, and the garbage-collector performs a constant amount of work per allocation + a (small) bounded pause for all collection/release operations.

Related papers

Real-Time Execution of Action Chunking Flow Policies [49.1574468325115]
This paper presents a novel inference-time algorithm that enables asynchronous execution of action interacting systems.<n>It is applicable to any diffusion- or VLA-based systems executing out of the box with no re-training.<n>Results show that RTC is fast, performant, and uniquely robust to inference manipulation.
arXiv Detail & Related papers (2025-06-09T01:01:59Z)
Deep-Learning-Driven Prefetching for Far Memory [4.128884162772407]
We present FarSight, a Linux-based far-memory system that leverages deep learning (DL) to efficiently perform accurate data prefetching.<n>Our evaluation of FarSight on four data-intensive workloads shows that it outperforms the state-of-the-art far-memory system by up to 3.6 times.
arXiv Detail & Related papers (2025-05-31T04:27:22Z)
eWAPA: An eBPF-based WASI Performance Analysis Framework for WebAssembly Runtimes [3.804314901623159]
WebAssembly (Wasm) is a low-level bytecode format that can run in modern browsers. We propose an eBPF-based WASI performance analysis framework. It collects key performance metrics of the runtime under different I/O load conditions, such as total execution time, startup time, WASI execution time, and syscall time.
arXiv Detail & Related papers (2024-09-16T13:03:09Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows [0.792324422300924]
We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries. In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity. We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently.
arXiv Detail & Related papers (2024-02-27T16:21:28Z)
RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts. handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms. We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z)
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models [77.0501668780182]
Retrieval augmentation addresses many critical problems in large language models. Running retrieval-augmented language models (LMs) is slow and difficult to scale due to processing large amounts of retrieved text. We introduce binary token representations (BTR), which use 1-bit vectors to precompute every token in passages.
arXiv Detail & Related papers (2023-10-02T16:48:47Z)
Introducing Language Guidance in Prompt-based Continual Learning [95.03110230754423]
We propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art.
arXiv Detail & Related papers (2023-08-30T08:03:49Z)
CHERI Performance Enhancement for a Bytecode Interpreter [0.0]
We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits) The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
arXiv Detail & Related papers (2023-08-09T17:12:23Z)
LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction. memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z)
POSET-RL: Phase ordering for Optimizing Size and Execution Time using Reinforcement Learning [0.0]
We present a reinforcement learning based solution to the phase ordering problem. We propose two approaches to model the sequences: one by manual ordering, and other based on a graph called Oz Dependence Graph (ODG)
arXiv Detail & Related papers (2022-07-27T08:32:23Z)
GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural Networks for Dead Store Detection [4.82596017481926]
We propose a learning-grained approach to identify unnecessary memory operations intelligently with low overhead. By applying several prevalent graph neural network models to extract program semantics, we present a novel, hybrid program embedding approach. Results show that our model achieves 90% of accuracy and incurs only around a half of time overhead of the state-of-art tool.
arXiv Detail & Related papers (2020-11-18T19:17:15Z)
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives. We develop novel data reuse analysis algorithms using the polyhedral model. We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.