SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
Recognition Evaluation
- URL: http://arxiv.org/abs/2403.08196v1
- Date: Wed, 13 Mar 2024 02:41:53 GMT
- Title: SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
Recognition Evaluation
- Authors: Jiayu Du, Jinpeng Li, Guoguo Chen, and Wei-Qiang Zhang
- Abstract summary: SpeechColab Leaderboard is a general-purpose, open-source platform designed for ASR evaluation.
We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems.
We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes.
- Score: 7.640323749917747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the wake of the surging tide of deep learning over the past decade,
Automatic Speech Recognition (ASR) has garnered substantial attention, leading
to the emergence of numerous publicly accessible ASR systems that are actively
being integrated into our daily lives. Nonetheless, the impartial and
replicable evaluation of these ASR systems encounters challenges due to various
crucial subtleties. In this paper we introduce the SpeechColab Leaderboard, a
general-purpose, open-source platform designed for ASR evaluation. With this
platform: (i) We report a comprehensive benchmark, unveiling the current
state-of-the-art panorama for ASR systems, covering both open-source models and
industrial commercial services. (ii) We quantize how distinct nuances in the
scoring pipeline influence the final benchmark outcomes. These include nuances
related to capitalization, punctuation, interjection, contraction, synonym
usage, compound words, etc. These issues have gained prominence in the context
of the transition towards an End-to-End future. (iii) We propose a practical
modification to the conventional Token-Error-Rate (TER) evaluation metric, with
inspirations from Kolmogorov complexity and Normalized Information Distance
(NID). This adaptation, called modified-TER (mTER), achieves proper
normalization and symmetrical treatment of reference and hypothesis. By
leveraging this platform as a large-scale testing ground, this study
demonstrates the robustness and backward compatibility of mTER when compared to
TER. The SpeechColab Leaderboard is accessible at
https://github.com/SpeechColab/Leaderboard
Related papers
- PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems [0.0]
This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions.<n>We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases.<n>Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children's speech, and specific linguistic challenges.
arXiv Detail & Related papers (2025-05-27T14:14:55Z) - ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems [3.8947802481286478]
We introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time.<n>Our approach reveals significant performance disparities in SOTA ASR models across demographic groups and offers a benchmark to drive the development of more inclusive ASR technologies.
arXiv Detail & Related papers (2025-05-16T11:31:31Z) - Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM-Based Conversational Systems [9.101091541480434]
We propose a neural architecture that includes an intermediate step in planning socio-emotional strategies before response generation.
Our study shows that predicting a sequence of expected strategy labels and using this sequence to generate a response yields better results than a direct end-to-end generation scheme.
arXiv Detail & Related papers (2024-11-26T08:15:36Z) - Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish [0.0]
Speech datasets available in the public domain are often underutilized because of challenges in discoverability and interoperability.
A comprehensive framework has been designed to survey, catalog, and curate available speech datasets.
This research constitutes the most extensive comparison to date of both commercial and free ASR systems for the Polish language.
arXiv Detail & Related papers (2024-07-18T21:32:12Z) - Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - WER we are and WER we think we are [11.819335591315316]
We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets.
We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark.
We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
arXiv Detail & Related papers (2020-10-07T14:20:31Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.