Related papers: asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

URL: http://arxiv.org/abs/2601.20992v1
Date: Wed, 28 Jan 2026 19:43:34 GMT
Title: asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation
Authors: Oleg Sedukhin, Andrey Kostin,
Abstract summary: We propose a string alignment algorithm that supports both multi-reference labeling and arbitrary-length insertions.<n>This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech.<n>We develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually.
Score: 0.5729426778193398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.

Related papers

CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions [0.5120567378386615]
We fine-tune the model to produce more verbatim speech transcriptions. We employ several techniques to increase robustness against multiple speakers and background noise.
arXiv Detail & Related papers (2024-08-29T14:52:42Z)
Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z)
DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z)
Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats [1.087459729391301]
We propose a new approach that can create binary feature dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Our system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.
arXiv Detail & Related papers (2024-05-07T12:40:59Z)
Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages. We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages. The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z)
Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components. This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.