SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
Recognition Evaluation
- URL: http://arxiv.org/abs/2403.08196v1
- Date: Wed, 13 Mar 2024 02:41:53 GMT
- Title: SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
Recognition Evaluation
- Authors: Jiayu Du, Jinpeng Li, Guoguo Chen, and Wei-Qiang Zhang
- Abstract summary: SpeechColab Leaderboard is a general-purpose, open-source platform designed for ASR evaluation.
We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems.
We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes.
- Score: 7.640323749917747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the wake of the surging tide of deep learning over the past decade,
Automatic Speech Recognition (ASR) has garnered substantial attention, leading
to the emergence of numerous publicly accessible ASR systems that are actively
being integrated into our daily lives. Nonetheless, the impartial and
replicable evaluation of these ASR systems encounters challenges due to various
crucial subtleties. In this paper we introduce the SpeechColab Leaderboard, a
general-purpose, open-source platform designed for ASR evaluation. With this
platform: (i) We report a comprehensive benchmark, unveiling the current
state-of-the-art panorama for ASR systems, covering both open-source models and
industrial commercial services. (ii) We quantize how distinct nuances in the
scoring pipeline influence the final benchmark outcomes. These include nuances
related to capitalization, punctuation, interjection, contraction, synonym
usage, compound words, etc. These issues have gained prominence in the context
of the transition towards an End-to-End future. (iii) We propose a practical
modification to the conventional Token-Error-Rate (TER) evaluation metric, with
inspirations from Kolmogorov complexity and Normalized Information Distance
(NID). This adaptation, called modified-TER (mTER), achieves proper
normalization and symmetrical treatment of reference and hypothesis. By
leveraging this platform as a large-scale testing ground, this study
demonstrates the robustness and backward compatibility of mTER when compared to
TER. The SpeechColab Leaderboard is accessible at
https://github.com/SpeechColab/Leaderboard
Related papers
- Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - End-to-End Evaluation for Low-Latency Simultaneous Speech Translation [55.525125193856084]
We propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions.
This includes the segmentation of the audio as well as the run-time of the different components.
We also compare different approaches to low-latency speech translation using this framework.
arXiv Detail & Related papers (2023-08-07T09:06:20Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - WER we are and WER we think we are [11.819335591315316]
We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets.
We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark.
We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
arXiv Detail & Related papers (2020-10-07T14:20:31Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.