Subspace-based Representation and Learning for Phonotactic Spoken
Language Recognition
- URL: http://arxiv.org/abs/2203.15576v1
- Date: Mon, 28 Mar 2022 07:01:45 GMT
- Title: Subspace-based Representation and Learning for Phonotactic Spoken
Language Recognition
- Authors: Hung-Shin Lee, Yu Tsao, Shyh-Kang Jeng, Hsin-Min Wang
- Abstract summary: We propose a new learning mechanism based on subspace-based representation.
It can extract concealed phonotactic structures from utterances for language verification and dialect/accent identification.
The proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods.
- Score: 27.268047798971473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Phonotactic constraints can be employed to distinguish languages by
representing a speech utterance as a multinomial distribution or phone events.
In the present study, we propose a new learning mechanism based on
subspace-based representation, which can extract concealed phonotactic
structures from utterances, for language verification and dialect/accent
identification. The framework mainly involves two successive parts. The first
part involves subspace construction. Specifically, it decodes each utterance
into a sequence of vectors filled with phone-posteriors and transforms the
vector sequence into a linear orthogonal subspace based on low-rank matrix
factorization or dynamic linear modeling. The second part involves subspace
learning based on kernel machines, such as support vector machines and the
newly developed subspace-based neural networks (SNNs). The input layer of SNNs
is specifically designed for the sample represented by subspaces. The topology
ensures that the same output can be derived from identical subspaces by
modifying the conventional feed-forward pass to fit the mathematical definition
of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007,
the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions
in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC
methods and the lattice-based PPR-LM method, respectively. Furthermore, on the
dialect/accent identification task of NIST LRE 2009, the SNN-based system
performed better than the aforementioned four baseline methods.
Related papers
- Subspace Representation Learning for Sparse Linear Arrays to Localize More Sources than Sensors: A Deep Learning Methodology [19.100476521802243]
We develop a novel methodology that estimates the co-array subspaces from a sample covariance for sparse linear array (SLA)
To learn such representations, we propose loss functions that gauge the separation between the desired and the estimated subspace.
The computation of learning subspaces of different dimensions is accelerated by a new batch sampling strategy.
arXiv Detail & Related papers (2024-08-29T15:14:52Z) - A Geometric Notion of Causal Probing [91.14470073637236]
In a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace.
We give a set of intrinsic criteria which characterize an ideal linear concept subspace.
We find that LEACE returns a one-dimensional subspace containing roughly half of total concept information.
arXiv Detail & Related papers (2023-07-27T17:57:57Z) - SubspaceNet: Deep Learning-Aided Subspace Methods for DoA Estimation [36.647703652676626]
SubspaceNet is a data-driven DoA estimator which learns how to divide the observations into distinguishable subspaces.
SubspaceNet is shown to enable various DoA estimation algorithms to cope with coherent sources, wideband signals, low SNR, array mismatches, and limited snapshots.
arXiv Detail & Related papers (2023-06-04T06:30:13Z) - PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning [2.236663830879273]
Logic Networks (LTNs) are neuro-symbolic systems based on a differentiable, first-order logic grounded into a deep neural network.
We focus here on the subsumption or textttisOfClass predicate, which is fundamental to encode most semantic image interpretation tasks.
We propose a common textttisOfClass predicate, whose level of truth is a function of the distance between an object embedding and the corresponding class prototype.
arXiv Detail & Related papers (2022-06-26T18:34:07Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Introducing Orthogonal Constraint in Structural Probes [0.2538209532048867]
We decompose a linear projection of language vector space into isomorphic space rotation and linear scaling directions.
We experimentally show that our approach can be performed in a multitask setting.
arXiv Detail & Related papers (2020-12-30T17:14:25Z) - Connecting Weighted Automata, Tensor Networks and Recurrent Neural
Networks through Spectral Learning [58.14930566993063]
We present connections between three models used in different research fields: weighted finite automata(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks.
We introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous vectors input.
arXiv Detail & Related papers (2020-10-19T15:28:00Z) - Nonlinear ISA with Auxiliary Variables for Learning Speech
Representations [51.9516685516144]
We introduce a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables.
We propose an algorithm that learns unsupervised speech representations whose subspaces are independent.
arXiv Detail & Related papers (2020-07-25T14:53:09Z) - Filtered Inner Product Projection for Crosslingual Embedding Alignment [28.72288652451881]
Filtered Inner Product Projection (FIPP) is a method for mapping embeddings to a common representation space.
FIPP is applicable even when the source and target embeddings are of differing dimensionalities.
We show that our approach outperforms existing methods on the MUSE dataset for various language pairs.
arXiv Detail & Related papers (2020-06-05T19:53:30Z) - Deep Metric Structured Learning For Facial Expression Recognition [58.7528672474537]
We propose a deep metric learning model to create embedded sub-spaces with a well defined structure.
A new loss function that imposes Gaussian structures on the output space is introduced to create these sub-spaces.
We experimentally demonstrate that the learned embedding can be successfully used for various applications including expression retrieval and emotion recognition.
arXiv Detail & Related papers (2020-01-18T06:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.