Related papers: FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities

FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities

URL: http://arxiv.org/abs/2508.07315v2
Date: Wed, 13 Aug 2025 15:09:16 GMT
Title: FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities
Authors: Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg,
Abstract summary: We present a novel open-source FlexCTC toolkit for fully-based beam decoding, designed for Connectionist Temporal Classification (CTC) models.<n>Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and alternative to traditional C++, or WFST-based GPUs.<n>It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting.
Score: 16.660841429852333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.

Related papers

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning [26.264303471292845]
We propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation.<n> Experiments show that StitchCUDA achieves nearly 100% success rate on end-to-end programming tasks, with 1.72x better speedup than the multi-agent baseline.
arXiv Detail & Related papers (2026-03-03T06:04:49Z)
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z)
FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs [0.7146800600221728]
We introduce FIDESlib, the first open-source server-side CKKS GPU library.<n>For bootstrapping, FIDESlib achieves no less than 70x speedup over the AVX-optimized OpenFHE implementation.
arXiv Detail & Related papers (2025-07-07T08:51:14Z)
HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration [13.53425131505526]
Deep learning has driven exponential increases in model parameters and computational demands.<n> NVIDIA GPUs and their-based software ecosystem provide robust support for parallel computing.<n>The ecosystem has established a dominant position in the field of parallel software.<n> translating code to other platforms poses significant challenges due to differences parallel programming paradigms and hardware.
arXiv Detail & Related papers (2025-06-12T06:48:33Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z)
Scaling Tractable Probabilistic Circuits: A Systems Perspective [53.76194929291088]
PyJuice is a general implementation design for PCs that improves prior art in several regards.<n>It is 1-2 orders of magnitude faster than existing systems at training large-scale PCs.<n>PyJuice consumes 2-5x less memory, which enables us to train larger models.
arXiv Detail & Related papers (2024-06-02T14:57:00Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition [1.2680687621338012]
Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
arXiv Detail & Related papers (2023-11-08T19:57:10Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Project CGX: Scalable Deep Learning on Commodity GPUs [17.116792714097738]
This paper investigates whether hardware overprovisioning can be supplanted via algorithmic and system design. We propose a framework called CGX, which provides efficient software support for communication compression. We show that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems.
arXiv Detail & Related papers (2021-11-16T17:00:42Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
This paper tackles the problem of real-time semantic segmentation of high definition videos using a hybrid GPU / CPU approach. We propose an Efficient Video(EVS) pipeline that combines: (i) On the CPU, a very fast optical flow method, that is used to exploit the temporal aspect of the video and propagate semantic information from one frame to the next. On the popular Cityscapes dataset with high resolution frames (2048 x 1024), the proposed operating points range from 80 to 1000 Hz on a single GPU and CPU.
arXiv Detail & Related papers (2019-12-26T11:45:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.