Related papers: We Augmented Whisper With kNN and You Won't Believe What Came Next

We Augmented Whisper With kNN and You Won't Believe What Came Next

URL: http://arxiv.org/abs/2410.18850v1
Date: Thu, 24 Oct 2024 15:32:52 GMT
Title: We Augmented Whisper With kNN and You Won't Believe What Came Next
Authors: Maya K. Nachesa, Vlad Niculae,
Abstract summary: We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
Score: 10.174848090916669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Related papers

Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST) GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z)
Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR) We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z)
Cross-lingual Low Resource Speaker Adaptation Using Phonological Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z)
Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement [1.7704011486040843]
We show that one can transfer an existing TTS model for new speakers from the same or a different language using only 20 minutes of data. We first introduce a base multi-lingual Tacotron with language-agnostic input, then demonstrate how transfer learning is done for different scenarios of speaker adaptation.
arXiv Detail & Related papers (2020-11-12T14:05:34Z)
End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)
Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications. We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.