Speaker embeddings by modeling channel-wise correlations
- URL: http://arxiv.org/abs/2104.02571v1
- Date: Tue, 6 Apr 2021 15:10:14 GMT
- Title: Speaker embeddings by modeling channel-wise correlations
- Authors: Themos Stafylakis, Johan Rohdin, Lukas Burget
- Abstract summary: We propose an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statistics.
The method is inspired by style-transfer methods in computer vision, where the style of an image, modeled by the matrix of channel-wise correlations, is transferred to another image.
By drawing analogies between image style and speaker characteristics, and between image content and phonetic sequence, we explore the use of such channel-wise correlations features to train a ResNet architecture.
- Score: 16.263418635038747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker embeddings extracted with deep 2D convolutional neural networks are
typically modeled as projections of first and second order statistics of
channel-frequency pairs onto a linear layer, using either average or attentive
pooling along the time axis. In this paper we examine an alternative pooling
method, where pairwise correlations between channels for given frequencies are
used as statistics. The method is inspired by style-transfer methods in
computer vision, where the style of an image, modeled by the matrix of
channel-wise correlations, is transferred to another image, in order to produce
a new image having the style of the first and the content of the second. By
drawing analogies between image style and speaker characteristics, and between
image content and phonetic sequence, we explore the use of such channel-wise
correlations features to train a ResNet architecture in an end-to-end fashion.
Our experiments on VoxCeleb demonstrate the effectiveness of the proposed
pooling method in speaker recognition.
Related papers
- Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation [60.27691946892796]
We present a method for generating video sequences with coherent motion between a pair of input key frames.
Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame techniques.
arXiv Detail & Related papers (2024-08-27T17:57:14Z) - Explicit Correspondence Matching for Generalizable Neural Radiance
Fields [49.49773108695526]
We present a new NeRF method that is able to generalize to new unseen scenarios and perform novel view synthesis with as few as two source views.
The explicit correspondence matching is quantified with the cosine similarity between image features sampled at the 2D projections of a 3D point on different views.
Our method achieves state-of-the-art results on different evaluation settings, with the experiments showing a strong correlation between our learned cosine feature similarity and volume density.
arXiv Detail & Related papers (2023-04-24T17:46:01Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Decoupled Mixup for Generalized Visual Recognition [71.13734761715472]
We propose a novel "Decoupled-Mixup" method to train CNN models for visual recognition.
Our method decouples each image into discriminative and noise-prone regions, and then heterogeneously combines these regions to train CNN models.
Experiment results show the high generalization performance of our method on testing data that are composed of unseen contexts.
arXiv Detail & Related papers (2022-10-26T15:21:39Z) - Action Recognition with Domain Invariant Features of Skeleton Image [25.519217340328442]
We propose a novel CNN-based method with adversarial training for action recognition.
We introduce a two-level domain adversarial learning to align the features of skeleton images from different view angles or subjects.
It achieves competitive results compared with state-of-the-art methods.
arXiv Detail & Related papers (2021-11-19T08:05:54Z) - Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition [17.009051842682677]
We propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming.
We apply self- and cross-attention to explicitly model the correlations within and between the input channels.
The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.
arXiv Detail & Related papers (2021-05-12T19:32:24Z) - Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network.
We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z) - End-to-End Lip Synchronisation Based on Pattern Classification [15.851638021923875]
We propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream.
We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
arXiv Detail & Related papers (2020-05-18T11:42:32Z) - Channel Interaction Networks for Fine-Grained Image Categorization [61.095320862647476]
Fine-grained image categorization is challenging due to the subtle inter-class differences.
We propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images.
Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing.
arXiv Detail & Related papers (2020-03-11T11:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.