A Data-Driven Investigation of Noise-Adaptive Utterance Generation with
Linguistic Modification
- URL: http://arxiv.org/abs/2210.10252v1
- Date: Wed, 19 Oct 2022 02:20:17 GMT
- Title: A Data-Driven Investigation of Noise-Adaptive Utterance Generation with
Linguistic Modification
- Authors: Anupama Chingacham, Vera Demberg, Dietrich Klakow
- Abstract summary: In noisy environments, speech can be hard to understand for humans.
We create a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing.
We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB.
- Score: 25.082714256583422
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In noisy environments, speech can be hard to understand for humans. Spoken
dialog systems can help to enhance the intelligibility of their output, either
by modifying the speech synthesis (e.g., imitate Lombard speech) or by
optimizing the language generation. We here focus on the second type of
approach, by which an intended message is realized with words that are more
intelligible in a specific noisy environment. By conducting a speech perception
experiment, we created a dataset of 900 paraphrases in babble noise, perceived
by native English speakers with normal hearing. We find that careful selection
of paraphrases can improve intelligibility by 33% at SNR -5 dB. Our analysis of
the data shows that the intelligibility differences between paraphrases are
mainly driven by noise-robust acoustic cues. Furthermore, we propose an
intelligibility-aware paraphrase ranking model, which outperforms baseline
models with a relative improvement of 31.37% at SNR -5 dB.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Exploring the Potential of Lexical Paraphrases for Mitigating
Noise-Induced Comprehension Errors [17.486619771816123]
Speech can be masked by noise, which may lead to word misperceptions on the side of the listener.
We propose an alternate solution of choosing noise-robust lexical paraphrases to represent an intended meaning.
We evaluate the intelligibility of synonyms in context and find that choosing a lexical unit that is less risky to be misheard than its synonym introduced an average gain in comprehension of 37% at SNR -5 dB and 21% at SNR 0 dB for babble noise.
arXiv Detail & Related papers (2021-07-18T01:16:33Z) - Supervised Contrastive Learning for Accented Speech Recognition [7.5253263976291676]
We study the supervised contrastive learning framework for accented speech recognition.
We show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average.
arXiv Detail & Related papers (2021-07-02T09:23:33Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.