Related papers: Ordered and Binary Speaker Embedding

Ordered and Binary Speaker Embedding

URL: http://arxiv.org/abs/2305.16043v1
Date: Thu, 25 May 2023 13:21:00 GMT
Title: Ordered and Binary Speaker Embedding
Authors: Jiaying Wang and Xianglong Wang and Namin Wang and Lantian Li and Dong Wang
Abstract summary: We propose an ordered binary embedding approach that sorts the dimensions of the embedding vector via a nested dropout and converts the sorted vectors to binary codes via Bernoulli sampling. The resultant ordered binary codes offer some important merits such as hierarchical clustering, reduced memory usage, and fast retrieval.
Score: 12.22202088781098
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern speaker recognition systems represent utterances by embedding vectors. Conventional embedding vectors are dense and non-structural. In this paper, we propose an ordered binary embedding approach that sorts the dimensions of the embedding vector via a nested dropout and converts the sorted vectors to binary codes via Bernoulli sampling. The resultant ordered binary codes offer some important merits such as hierarchical clustering, reduced memory usage, and fast retrieval. These merits were empirically verified by comprehensive experiments on a speaker identification task with the VoxCeleb and CN-Celeb datasets.

Related papers

Bidirectional Logits Tree: Pursuing Granularity Reconcilement in Fine-Grained Classification [89.20477310885731]
This paper addresses the challenge of Granularity Competition in fine-grained classification tasks. Existing approaches typically develop independent hierarchy-aware models based on shared features extracted from a common base encoder. We propose a novel framework called the Bidirectional Logits Tree (BiLT) for Granularity Reconcilement.
arXiv Detail & Related papers (2024-12-17T10:42:19Z)
Sequence Shortening for Context-Aware Machine Translation [5.803309695504831]
We show that a special case of multi-encoder architecture achieves higher accuracy on contrastive datasets. We introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context.
arXiv Detail & Related papers (2024-02-02T13:55:37Z)
Emergence of Latent Binary Encoding in Deep Neural Network Classifiers [0.0]
We investigate the emergence of binary encoding within the latent space of deep-neural-network classifiers. By analyzing several datasets of increasing complexity, we provide empirical evidence that the emergence of binary encoding dramatically enhances robustness.
arXiv Detail & Related papers (2023-10-12T11:16:57Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Hierarchical Sketch Induction for Paraphrase Generation [79.87892048285819]
We introduce Hierarchical Refinement Quantized Variational Autoencoders (HRQ-VAE), a method for learning decompositions of dense encodings. We use HRQ-VAE to encode the syntactic form of an input sentence as a path through the hierarchy, allowing us to more easily predict syntactic sketches at test time.
arXiv Detail & Related papers (2022-03-07T15:28:36Z)
Nearest neighbor search with compact codes: A decoder perspective [77.60612610421101]
We re-interpret popular methods such as binary hashing or product quantizers as auto-encoders. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes.
arXiv Detail & Related papers (2021-12-17T15:22:28Z)
Sparse Coding with Multi-Layer Decoders using Variance Regularization [19.8572592390623]
We propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold. We show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations.
arXiv Detail & Related papers (2021-12-16T21:46:23Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Multimodal Representation for Neural Code Search [18.371048875103497]
We introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search.
arXiv Detail & Related papers (2021-07-02T12:08:19Z)
byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings [77.6701264226519]
We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
arXiv Detail & Related papers (2021-06-24T20:14:48Z)
Acoustic Neighbor Embeddings [2.842794675894731]
This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. The recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
arXiv Detail & Related papers (2020-07-20T05:33:07Z)
Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR) The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.