What all do audio transformer models hear? Probing Acoustic
Representations for Language Delivery and its Structure
- URL: http://arxiv.org/abs/2101.00387v1
- Date: Sat, 2 Jan 2021 06:29:12 GMT
- Title: What all do audio transformer models hear? Probing Acoustic
Representations for Language Delivery and its Structure
- Authors: Jui Shah, Yaman Kumar Singla, Changyou Chen, Rajiv Ratn Shah
- Abstract summary: We compare audio transformer models Mockingjay and wave2vec2.0.
We probe the audio models' understanding of textual surface, syntax, and semantic features.
We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets.
- Score: 64.54208910952651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent times, BERT based transformer models have become an inseparable
part of the 'tech stack' of text processing models. Similar progress is being
observed in the speech domain with a multitude of models observing
state-of-the-art results by using audio transformer models to encode speech.
This begs the question of what are these audio transformer models learning.
Moreover, although the standard methodology is to choose the last layer
embedding for any downstream task, but is it the optimal choice? We try to
answer these questions for the two recent audio transformer models, Mockingjay
and wave2vec2.0. We compare them on a comprehensive set of language delivery
and structure features including audio, fluency and pronunciation features.
Additionally, we probe the audio models' understanding of textual surface,
syntax, and semantic features and compare them to BERT. We do this over
exhaustive settings for native, non-native, synthetic, read and spontaneous
speech datasets
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild [42.788845796159045]
We introduce VoiceCraft, a token infilling neural language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech tasks.
On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness.
For zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2.
arXiv Detail & Related papers (2024-03-25T17:38:32Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.
Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech.
We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.