Earnings-21: A Practical Benchmark for ASR in the Wild
- URL: http://arxiv.org/abs/2104.11348v1
- Date: Thu, 22 Apr 2021 23:04:28 GMT
- Title: Earnings-21: A Practical Benchmark for ASR in the Wild
- Authors: Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang,
Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr
Zelasko, Miguel Jette
- Abstract summary: We present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors.
We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model.
Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage.
- Score: 4.091202801240259
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Commonly used speech corpora inadequately challenge academic and commercial
ASR systems. In particular, speech corpora lack metadata needed for detailed
analysis and WER measurement. In response, we present Earnings-21, a 39-hour
corpus of earnings calls containing entity-dense speech from nine different
financial sectors. This corpus is intended to benchmark ASR systems in the wild
with special attention towards named entity recognition. We benchmark four
commercial ASR models, two internal models built with open-source tools, and an
open-source LibriSpeech model and discuss their differences in performance on
Earnings-21. Using our recently released fstalign tool, we provide a candid
analysis of each model's recognition capabilities under different partitions.
Our analysis finds that ASR accuracy for certain NER categories is poor,
presenting a significant impediment to transcript comprehension and usage.
Earnings-21 bridges academic and commercial ASR system evaluation and enables
further research on entity modeling and WER on real world audio.
Related papers
- Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques [17.166092544686553]
This study benchmarks Speech Emotion Recognition using ASR transcripts with varying Word Error Rates (WERs) from eleven models on three well-known corpora.
We propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript.
arXiv Detail & Related papers (2024-06-12T15:59:25Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - On the Efficacy and Noise-Robustness of Jointly Learned Speech Emotion
and Automatic Speech Recognition [6.006652562747009]
We investigate a joint ASR-SER learning approach in a low-resource setting.
Joint learning can improve ASR word error rate (WER) and SER classification accuracy by 10.7% and 2.3% respectively.
Overall, the joint ASR-SER approach yielded more noise-resistant models than the independent ASR and SER approaches.
arXiv Detail & Related papers (2023-05-21T18:52:21Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Earnings-22: A Practical Benchmark for Accents in the Wild [0.8039067099377079]
We present Earnings-22, a 125 file, 119 hour corpus of English-language earnings calls gathered from global companies.
By examining Individual Word Error Rate (IWER), we find that key speech features impact model performance more for certain accents than others.
arXiv Detail & Related papers (2022-03-29T14:02:57Z) - Fusing ASR Outputs in Joint Training for Speech Emotion Recognition [14.35400087127149]
We propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training Speech Emotion Recognition (SER)
In joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most.
We also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
arXiv Detail & Related papers (2021-10-29T11:21:17Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.