Unsupervised feature learning for speech using correspondence and
Siamese networks
- URL: http://arxiv.org/abs/2003.12799v1
- Date: Sat, 28 Mar 2020 14:31:01 GMT
- Title: Unsupervised feature learning for speech using correspondence and
Siamese networks
- Authors: Petri-Johan Last, Herman A. Engelbrecht, Herman Kamper
- Abstract summary: We compare two recent methods for frame-level acoustic feature learning.
For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type.
For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs.
For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs.
- Score: 24.22616495324351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In zero-resource settings where transcribed speech audio is unavailable,
unsupervised feature learning is essential for downstream speech processing
tasks. Here we compare two recent methods for frame-level acoustic feature
learning. For both methods, unsupervised term discovery is used to find pairs
of word examples of the same unknown type. Dynamic programming is then used to
align the feature frames between each word pair, serving as weak top-down
supervision for the two models. For the correspondence autoencoder (CAE),
matching frames are presented as input-output pairs. The Triamese network uses
a contrastive loss to reduce the distance between frames of the same predicted
word type while increasing the distance between negative examples. For the
first time, these feature extractors are compared on the same discrimination
tasks using the same weak supervision pairs. We find that, on the two datasets
considered here, the CAE outperforms the Triamese network. However, we show
that a new hybrid correspondence-Triamese approach (CTriamese), consistently
outperforms both the CAE and Triamese models in terms of average precision and
ABX error rates on both English and Xitsonga evaluation data.
Related papers
- Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - A Comparison of Discrete Latent Variable Models for Speech
Representation Learning [46.52258734975676]
This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal.
Results show that future time-step prediction with vq-wav2vec achieves better performance.
arXiv Detail & Related papers (2020-10-24T01:22:14Z) - End-to-End Lip Synchronisation Based on Pattern Classification [15.851638021923875]
We propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream.
We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
arXiv Detail & Related papers (2020-05-18T11:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.