Content-Context Factorized Representations for Automated Speech
Recognition
- URL: http://arxiv.org/abs/2205.09872v1
- Date: Thu, 19 May 2022 21:34:40 GMT
- Title: Content-Context Factorized Representations for Automated Speech
Recognition
- Authors: David M. Chan, Shalini Ghosh
- Abstract summary: We introduce an unsupervised, encoder-agnostic method for factoring speech-encoder representations into explicit content-encoding representations and spurious context-encoding representations.
We demonstrate improved performance on standard ASR benchmarks, as well as improved performance in both real-world and artificially noisy ASR scenarios.
- Score: 12.618527387900079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have largely demonstrated their ability to perform
automated speech recognition (ASR) by extracting meaningful features from input
audio frames. Such features, however, may consist not only of information about
the spoken language content, but also may contain information about unnecessary
contexts such as background noise and sounds or speaker identity, accent, or
protected attributes. Such information can directly harm generalization
performance, by introducing spurious correlations between the spoken words and
the context in which such words were spoken. In this work, we introduce an
unsupervised, encoder-agnostic method for factoring speech-encoder
representations into explicit content-encoding representations and spurious
context-encoding representations. By doing so, we demonstrate improved
performance on standard ASR benchmarks, as well as improved performance in both
real-world and artificially noisy ASR scenarios.
Related papers
- Disentangling Textual and Acoustic Features of Neural Speech Representations [23.486891834252535]
We build upon the Information Bottleneck principle to propose a disentanglement framework for complex speech representations.
We apply our framework to emotion recognition and speaker identification downstream tasks.
arXiv Detail & Related papers (2024-10-03T22:48:04Z) - STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning [6.363223418619587]
We introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy.
Based on the evaluation of speech dialogues, our method shows superior results compared to baselines.
arXiv Detail & Related papers (2024-08-12T10:21:09Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.