Related papers: A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems

URL: http://arxiv.org/abs/2509.24478v1
Date: Mon, 29 Sep 2025 08:53:02 GMT
Title: A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems
Authors: Lasse Borgholt, Jakob Havtorn, Christian Igel, Lars Maaløe, Zheng-Hua Tan,
Abstract summary: Modern neural networks have greatly improved performance across speech recognition benchmarks.<n>Errors in rare terms, named entities, and domain-specific vocabulary are more consequential, but remain hidden by aggregate metrics.<n>We propose a novel alignment algorithm that couples dynamic programming with beam search scoring.
Score: 23.218327444488164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern neural networks have greatly improved performance across speech recognition benchmarks. However, gains are often driven by frequent words with limited semantic weight, which can obscure meaningful differences in word error rate, the primary evaluation metric. Errors in rare terms, named entities, and domain-specific vocabulary are more consequential, but remain hidden by aggregate metrics. This highlights the need for finer-grained error analysis, which depends on accurate alignment between reference and model transcripts. However, conventional alignment methods are not designed for such precision. We propose a novel alignment algorithm that couples dynamic programming with beam search scoring. Compared to traditional text alignment methods, our approach provides more accurate alignment of individual errors, enabling reliable error analysis. The algorithm is made available via PyPI.

Related papers

LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting [118.93173826110815]
We propose a novel parameterized text shape method based on low-rank approximation for precise detection.<n>By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation.<n>We integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++.
arXiv Detail & Related papers (2025-11-08T03:08:03Z)
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications [5.266869303483375]
The Word Error Rate (WER) is the common measure of accuracy for Automatic Speech Recognition (ASR) We present a non-destructive, token-based approach using an extended Levenshtein distance algorithm to compute a robust WER. We also provide an exemplary analysis of derived use cases, such as a punctuation error rate, and a web application for interactive use and visualisation of our implementation.
arXiv Detail & Related papers (2024-08-28T08:14:51Z)
Self-consistent context aware conformer transducer for speech recognition [0.06008132390640294]
We introduce a novel neural network module that adeptly handles recursive data flow in neural network architectures. Our method notably improves the accuracy of recognizing rare words without adversely affecting the word error rate for common vocabulary. Our findings reveal that the combination of both approaches can improve the accuracy of detecting rare words by as much as 4.5 times.
arXiv Detail & Related papers (2024-02-09T18:12:11Z)
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions. We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z)
ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction [30.486396813844195]
We present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. Experimental results across five datasets demonstrate that our proposed method achieves significantly lower word error rates (WERs) than previous approaches.
arXiv Detail & Related papers (2023-10-08T11:40:30Z)
Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version) [21.325463387256807]
Two new metrics are proposed, Text-based Diarization Error Rate and Diarization F1, which perform utterance- and word-level evaluations. Our metrics encompass more types of errors compared to existing ones, allowing us to make a more comprehensive analysis in speaker diarization.
arXiv Detail & Related papers (2023-09-14T12:43:26Z)
End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document. Standard metrics do not take into account the inconsistencies that might appear. We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
Automatic Vocabulary and Graph Verification for Accurate Loop Closure Detection [21.862978912891677]
Bag-of-Words (BoW) builds a visual vocabulary to associate features and then detect loops. We propose a natural convergence criterion based on the comparison between the radii of nodes and the drifts of feature descriptors. We present a novel topological graph verification method for validating candidate loops.
arXiv Detail & Related papers (2021-07-30T13:19:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.