Related papers: Evaluating the Overhead of the Performance Profiler Cloudprofiler With MooBench

Evaluating the Overhead of the Performance Profiler Cloudprofiler With MooBench

URL: http://arxiv.org/abs/2411.17413v1
Date: Tue, 26 Nov 2024 13:20:19 GMT
Title: Evaluating the Overhead of the Performance Profiler Cloudprofiler With MooBench
Authors: Shinhyung Yang, David Georg Reichelt, Wilhelm Hasselbring,
Abstract summary: In this work, we measure the overhead of Cloudprofiler, a performance profiler implemented in C++ to measure native and disk processes. It minimizes the profiling overhead by locating the profiler process outside the target process and moving the writing overhead off the critical path. It is 6.15 times faster than the non-buffered and non-compression handler.
Score: 0.2867517731896504
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performance engineering has become crucial for the cloud-native architecture. This architecture deploys multiple services, with each service representing an orchestration of containerized processes. OpenTelemetry is growing popular in the cloud-native industry for observing the software's behaviour, and Kieker provides the necessary tools to monitor and analyze the performance of target architectures. Observability overhead is an important aspect of performance engineering and MooBench is designed to compare different observability frameworks, including OpenTelemetry and Kieker. In this work, we measure the overhead of Cloudprofiler, a performance profiler implemented in C++ to measure native and JVM processes. It minimizes the profiling overhead by locating the profiler process outside the target process and moving the disk writing overhead off the critical path with buffer blocks and compression threads. Using MooBench, Cloudprofiler's buffered ID handler with the Zstandard lossless data compression ZSTD showed an average execution time of 2.28 microseconds. It is 6.15 times faster than the non-buffered and non-compression handler.

Related papers

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework. APB uses multi-host approximate attention to enhance prefill speed. APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z)
Tracezip: Efficient Distributed Tracing via Trace Compression [26.353398496686854]
Distributed tracing serves as a fundamental building block in the monitoring and testing of cloud service systems. Head-based sampling indiscriminately selects requests to trace when they enter the system, which may miss critical events. tail-based sampling first captures all requests and then selectively persists the edge-case traces. We propose Tracezip to enhance the efficiency of distributed tracing via trace compression.
arXiv Detail & Related papers (2025-02-10T10:13:57Z)
SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow. SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns. We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks [1.3398445165628463]
This paper provides a comprehensive analysis of fault recovery performance, stability, and recovery time in a cloud-native environment. Our results indicate that Flink is the most stable and has one of the best fault recovery. K Kafka Streams shows suitable fault recovery performance and stability, but with higher event latency.
arXiv Detail & Related papers (2024-04-09T10:49:23Z)
ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks [1.4374467687356276]
This paper introduces ShuffleBench, a novel benchmark to evaluate the performance of modern stream processing frameworks. ShuffleBench is inspired by requirements for near real-time analytics of a large cloud observability platform. Our results show that Flink achieves the highest throughput while Hazelcast processes data streams with the lowest latency.
arXiv Detail & Related papers (2024-03-07T15:06:24Z)
Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud [0.38073142980732994]
We benchmark five modern stream processing frameworks regarding their scalability using a systematic method. All benchmarked frameworks exhibit approximately linear scalability as long as sufficient cloud resources are provisioned. There is no clear superior framework, but the ranking of the frameworks on the use case.
arXiv Detail & Related papers (2023-03-20T13:22:03Z)
SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z)
Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA [5.575293536755127]
Real-world applications require high performance inference on real-time streaming dynamic graphs. We present a novel model-architecture co-design for inference in memory-based TGNNs on FPGAs. We train our simplified models using knowledge distillation to ensure similar accuracy vis-'a-vis the original model.
arXiv Detail & Related papers (2022-03-10T00:24:47Z)
Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations [14.432131909590824]
Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others. Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads. We propose a framework for generating scalable reinforcement learning implementations on multicore systems.
arXiv Detail & Related papers (2021-10-03T21:00:53Z)
ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception [91.24236600199542]
ASH is a modern and high-performance framework for parallel spatial hashing on GPU. ASH achieves higher performance, supports richer functionality, and requires fewer lines of code. ASH and its example applications are open sourced in Open3D.
arXiv Detail & Related papers (2021-10-01T16:25:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.