Related papers: TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

URL: http://arxiv.org/abs/2510.05485v1
Date: Tue, 07 Oct 2025 01:02:46 GMT
Title: TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation
Authors: Adam Filipek,
Abstract summary: In this paper, we introduce a novel implementation of the BLEU metric designed from the ground up for this specific use case.<n>Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch.<n>By creating a compact, batch-specific dictionary of n-grams using texttttorch.unique, our method avoids the prohibitive memory costs of traditional hashing-based vectorization.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a "Token-ID BLEU" for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

Related papers

Block Sparse Flash Attention [29.499030734003952]
Block-Sparse FlashAttention is a drop-in replacement for FlashAttention.<n>It computes exact query-key similarities to select the top-k most important value blocks for each query.<n>It achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x needle-in-a-haystack retrieval tasks.
arXiv Detail & Related papers (2025-12-07T21:20:12Z)
Elementary, My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset [1.497166779417398]
Keywords Spotting (KWS) is a privacy-aware intermediate task for brain-computer interfaces.<n>We release an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials.
arXiv Detail & Related papers (2025-10-23T22:44:50Z)
Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression [57.54335545892155]
We introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook.<n>Our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines.
arXiv Detail & Related papers (2025-10-23T20:19:48Z)
Dynamic Sparse Attention on Mobile SoCs [11.250584640139998]
This paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU.<n>The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute.
arXiv Detail & Related papers (2025-08-22T07:41:35Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Cut Your Losses in Large-Vocabulary Language Models [102.6981011879656]
We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory.<n>CCE reduces the memory footprint of the loss from 24 GB to 1 MB, and the total training-time memory consumption of the head from 28 GB to 1 GB.
arXiv Detail & Related papers (2024-11-13T20:30:15Z)
Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees [11.732842929815401]
Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) data. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost.
arXiv Detail & Related papers (2023-09-18T17:49:09Z)
High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison [0.0]
We present a versatile GPU-based parallel version of Logistic Regression (LR) Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al. Our method is particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection.
arXiv Detail & Related papers (2023-08-19T14:49:37Z)
Efficient CNN with uncorrelated Bag of Features pooling [98.78384185493624]
Bag of Features (BoF) has been recently proposed to reduce the complexity of convolution layers. We propose an approach that builds on top of BoF pooling to boost its efficiency by ensuring that the items of the learned dictionary are non-redundant. The proposed strategy yields an efficient variant of BoF and further boosts its performance, without any additional parameters.
arXiv Detail & Related papers (2022-09-22T09:00:30Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages. We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures. Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging. We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points. We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.