Perceptimatic: A human speech perception benchmark for unsupervised
subword modelling
- URL: http://arxiv.org/abs/2010.05961v1
- Date: Mon, 12 Oct 2020 18:40:08 GMT
- Title: Perceptimatic: A human speech perception benchmark for unsupervised
subword modelling
- Authors: Juliette Millet and Ewan Dunbar
- Abstract summary: We present a data set and methods to compare speech processing models and human behaviour on a phone discrimination task.
We provide Perceptimatic, an open data set which consists of French and English speech stimuli, as well as the results of 91 English- and 93 French-speaking listeners.
The stimuli test a wide range of French and English contrasts, and are extracted directly from corpora of natural running read speech.
We show that, unlike unsupervised models and supervised multilingual models, a standard supervised monolingual HMM-GMM phone recognition system, while good at discriminating phones, yields a representational space very
- Score: 11.646802225841153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a data set and methods to compare speech processing
models and human behaviour on a phone discrimination task. We provide
Perceptimatic, an open data set which consists of French and English speech
stimuli, as well as the results of 91 English- and 93 French-speaking
listeners. The stimuli test a wide range of French and English contrasts, and
are extracted directly from corpora of natural running read speech, used for
the 2017 Zero Resource Speech Challenge. We provide a method to compare humans'
perceptual space with models' representational space, and we apply it to models
previously submitted to the Challenge. We show that, unlike unsupervised models
and supervised multilingual models, a standard supervised monolingual HMM-GMM
phone recognition system, while good at discriminating phones, yields a
representational space very different from that of human native listeners.
Related papers
- Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Do self-supervised speech models develop human-like perception biases? [11.646802225841153]
We examine the representational spaces of three kinds of state-of-the-art self-supervised models: wav2vec 2.0, HuBERT and contrastive predictive coding ( CPC)
We show that the CPC model shows a small native language effect, but that wav2vec 2.0 and HuBERT seem to develop a universal speech perception space which is not language specific.
A comparison against the predictions of supervised phone recognisers suggests that all three self-supervised models capture relatively fine-grained perceptual phenomena, while supervised models are better at capturing coarser, phone-level, effects of listeners' native language, on perception.
arXiv Detail & Related papers (2022-05-31T14:21:40Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - The Perceptimatic English Benchmark for Speech Perception Models [11.646802225841153]
The benchmark consists of ABX stimuli along with the responses of 91 American English-speaking listeners.
We show that DeepSpeech, a standard English speech recognizer, is more specialized on English phoneme discrimination than English listeners.
arXiv Detail & Related papers (2020-05-07T12:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.