Related papers: GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping

GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping

URL: http://arxiv.org/abs/2103.14978v2
Date: Wed, 31 Mar 2021 08:55:06 GMT
Title: GateKeeper-GPU: Fast and Accurate Pre-Alignment Filtering in Short Read Mapping
Authors: Z\"ulal Bing\"ol, Mohammed Alser, Onur Mutlu, Ozcan Ozturk, Can Alkan
Abstract summary: GateKeeper-GPU is a fast and accurate pre-alignment filter for sequence alignment. It is exploited by the large number of GPU threads to examine numerous sequence pairs rapidly and concurrently. GateKeeper-GPU accelerates the sequence alignment by up to 2.9x and provides up to 1.4x speedup to the end-to-end execution time of a comprehensive read mapper.
Score: 7.680154692488026
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: At the last step of short read mapping, the candidate locations of the reads on the reference genome are verified to compute their differences from the corresponding reference segments using sequence alignment algorithms. Calculating the similarities and differences between two sequences is still computationally expensive since approximate string matching techniques traditionally inherit dynamic programming algorithms with quadratic time and space complexity. We introduce GateKeeper-GPU, a fast and accurate pre-alignment filter that efficiently reduces the need for expensive sequence alignment. GateKeeper-GPU provides two main contributions: first, improving the filtering accuracy of GateKeeper(state-of-the-art lightweight pre-alignment filter), second, exploiting the massive parallelism provided by the large number of GPU threads of modern GPUs to examine numerous sequence pairs rapidly and concurrently. GateKeeper-GPU accelerates the sequence alignment by up to 2.9x and provides up to 1.4x speedup to the end-to-end execution time of a comprehensive read mapper (mrFAST). GateKeeper-GPU is available at https://github.com/BilkentCompGen/GateKeeper-GPU

Related papers

GPUTOK: GPU Accelerated Byte Level BPE Tokenization [0.0]
We build a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules.<n>It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python.<n>On WikiText103 sequences up to 131k tokens, the optimized tokenizer produces the same longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer.
arXiv Detail & Related papers (2026-03-03T04:48:28Z)
GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z)
Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z)
GPU-Accelerated Interpretable Generalization for Rapid Cyberattack Detection and Forensics [0.0]
IG mechanism recently published in IEEE Transactions on Information Forensics and Security delivers state-of-the-art, evidence-based intrusion detection.<n>We present IG-GPU, a PyTorch re-architecture that offloads all pairwise intersections and subset evaluations to commodity GPU.<n>In 15k-record NSL-KDD dataset, IG-GPU shows a 116-fold speed-up over the multi-core CPU implementation of IG.
arXiv Detail & Related papers (2025-07-16T12:38:19Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation [11.120838175165986]
Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML) Many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE. Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code.
arXiv Detail & Related papers (2025-02-16T12:53:23Z)
Implementation and Analysis of GPU Algorithms for Vecchia Approximation [0.8057006406834466]
Vecchia Approximation is widely used to reduce the computational complexity and can be calculated with embarrassingly parallel algorithms. While multi-core software has been developed for Vecchia Approximation, software designed to run on graphics processing units ( GPU) is lacking. We show that our new method outperforms the other two and then present it in the GpGpU R package.
arXiv Detail & Related papers (2024-07-03T01:24:44Z)
Minuet: Accelerating 3D Sparse Convolutions on GPUs [9.54287796030519]
Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74times$ (up to $2.22times$) for end-to-end point cloud network executions.
arXiv Detail & Related papers (2023-12-01T05:09:02Z)
High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison [0.0]
We present a versatile GPU-based parallel version of Logistic Regression (LR) Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al. Our method is particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection.
arXiv Detail & Related papers (2023-08-19T14:49:37Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization [5.02836935036198]
We propose RTGPU, which can schedule the execution of multiple GPU applications in real-time to meet hard deadlines. Our approach provides superior schedulability compared with previous work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications.
arXiv Detail & Related papers (2021-01-25T22:34:06Z)
GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)
MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle. Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
Parallelising the Queries in Bucket Brigade Quantum RAM [69.43216268165402]
Quantum algorithms often use quantum RAMs (QRAM) for accessing information stored in a database-like manner. We show a systematic method to significantly reduce the effective query time by using Clifford+T gate parallelism. We conclude that, in theory, fault-tolerant bucket brigade quantum RAM queries can be performed approximately with the speed of classical RAM.
arXiv Detail & Related papers (2020-02-21T14:50:03Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.