A study on native American English speech recognition by Indian
listeners with varying word familiarity level
- URL: http://arxiv.org/abs/2112.04151v1
- Date: Wed, 8 Dec 2021 07:43:38 GMT
- Title: A study on native American English speech recognition by Indian
listeners with varying word familiarity level
- Authors: Abhayjeet Singh, Achuth Rao MV, Rakesh Vaideeswaran, Chiranjeevi
Yarra, Prasanta Kumar Ghosh
- Abstract summary: We have three kinds of responses from each listener while they recognize an utterance.
From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.
Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities.
- Score: 62.14295630922855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, listeners of varied Indian nativities are asked to listen and
recognize TIMIT utterances spoken by American speakers. We have three kinds of
responses from each listener while they recognize an utterance: 1. Sentence
difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the
utterance. From these transcriptions, word error rate (WER) is calculated and
used as a metric to evaluate the similarity between the recognized and the
original sentences.The sentences selected in this study are categorized into
three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the
words in them. We observe that the sentence, speaker difficulty ratings and the
WERs increase from easy to hard categories of sentences. We also compare the
human speech recognition performance with that using three automatic speech
recognition (ASR) under following three combinations of acoustic model (AM) and
language model(LM): ASR1) AM trained with recordings from speakers of Indian
origin and LM built on TIMIT text, ASR2) AM using recordings from native
American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM
using recordings from native American speakers and LM build on LIBRI speech and
TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas
ASR3 achieves the best performance. Speaker nativity wise analysis shows that
utterances from speakers of some nativity are more difficult to recognize by
Indian listeners compared to few other nativities
Related papers
- Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue [41.10328851671422]
SpeechLLMs have demonstrated impressive spoken dialog question-answering (SQA) performance in benchmarks like Gaokao.
We show that SpeechLLMs exhibit limited speaker awareness from the audio and behave similarly to an LLM reasoning from the conversation transcription without sound.
We propose that tasks focused on identity-critical questions could offer a more accurate evaluation framework of SpeechLLMs in SQA.
arXiv Detail & Related papers (2024-09-07T22:54:47Z) - You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish [0.5249805590164903]
We focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services.
We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.
arXiv Detail & Related papers (2024-05-22T06:24:55Z) - A Deep Dive into the Disparity of Word Error Rates Across Thousands of
NPTEL MOOC Videos [4.809236881780707]
We describe the curation of a massive speech dataset of 8740 hours consisting of $sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography.
We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India.
arXiv Detail & Related papers (2023-07-20T05:03:00Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Do We Still Need Automatic Speech Recognition for Spoken Language
Understanding? [14.575551366682872]
We show that learned speech features are superior to ASR transcripts on three classification tasks.
We highlight the intrinsic robustness of wav2vec 2.0 representations to out-of-vocabulary words as key to better performance.
arXiv Detail & Related papers (2021-11-29T15:13:36Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - The Perceptimatic English Benchmark for Speech Perception Models [11.646802225841153]
The benchmark consists of ABX stimuli along with the responses of 91 American English-speaking listeners.
We show that DeepSpeech, a standard English speech recognizer, is more specialized on English phoneme discrimination than English listeners.
arXiv Detail & Related papers (2020-05-07T12:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.