Related papers: FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

URL: http://arxiv.org/abs/2510.09085v1
Date: Fri, 10 Oct 2025 07:32:54 GMT
Title: FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms
Authors: Atul Shree, Harshith Jupuru,
Abstract summary: CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments.<n>This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC)<n>FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation.
Score: 1.518298096221251
License: http://creativecommons.org/licenses/by/4.0/
Abstract: CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

Related papers

You Need an Encoder for Native Position-Independent Caching [28.778240400537175]
Key-Value cache of Large Language Models (LLMs) is prefix-based.<n>Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints.<n>We propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC.<n>We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks.
arXiv Detail & Related papers (2026-02-02T01:23:13Z)
Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
LATTE: A Decoding Architecture for Quantum Computing with Temporal and Spatial Scalability [7.184133388805955]
We introduce a FPGA- hybrid decoding architecture, LATTE, to address the key requirements of scaling up in lattice surgery quantum overhead.<n>LATTE delivers accuracy on par with the base decoder while achieving real-time decoding throughput and significantly reducing both bandwidth requirements and computational resources.
arXiv Detail & Related papers (2025-09-04T07:29:21Z)
Silentflow: Leveraging Trusted Execution for Resource-Limited MPC via Hardware-Algorithm Co-design [6.998260344481881]
We introduce Silentflow, a protocol that eliminates communication in COT generation.<n>We balance end-to-end latency and resource demands, achieving up to 39.51x speedup over state-of-the-art protocols.
arXiv Detail & Related papers (2025-08-18T21:00:10Z)
Towards Practical Real-Time Neural Video Compression [60.390180067626396]
We introduce a practical real-time neural video (NVC) designed to deliver high compression ratio, low latency and broad versatility.<n>Experiments show our proposed DCVC-RT achieves an impressive average encoding/desampling speed 125.2/112.8 (frames per second) for 1080p video, while saving an average of 21% in fps compared to H.266/VTM.
arXiv Detail & Related papers (2025-02-28T06:32:23Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference [11.194752361478567]
FDC is a fast KV dimensionality compression system that eliminates the decompression overhead incurred in the existing KV dimensionality compression system, Palu, and reduces attention time.<n>In experiments, FDC can reduce Job Completion Time (JCT) by up to 64%, and delivers up to 1.97X throughput under the same latency.<n>When state-of-the-art eviction and quantization methods are combined with FDC, they exhibit similar improvements compared to those combined with Palu.
arXiv Detail & Related papers (2024-08-07T22:10:26Z)
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition [1.2680687621338012]
Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
arXiv Detail & Related papers (2023-11-08T19:57:10Z)
Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR [67.63332492134332]
We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arXiv Detail & Related papers (2023-03-31T23:30:48Z)
CTC Variations Through New WFST Topologies [79.94035631317395]
This paper presents novel Weighted Finite-State Transducer (WFST) topologies to implement Connectionist Temporal Classification (CTC)-like algorithms for automatic speech recognition. Three new CTC variants are proposed: (1) the "compact-CTC", in which direct transitions between units are replaced with epsilon> back-off transitions; (2) the "minimal-CTC", that only adds blank> self-loops when used in WFST-composition; and (3) "selfless-CTC", that disallows self-loop for non-blank units.
arXiv Detail & Related papers (2021-10-06T23:00:15Z)
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. The proposed method significantly reduces recognition errors and emission latency simultaneously. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z)
Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model [164.7489982837475]
This paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model ( RPM) The RAE employs recurrent cells in both the encoder and decoder to exploit the temporal correlation among video frames. Our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM.
arXiv Detail & Related papers (2020-06-24T08:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.