A Hierarchical Model for Spoken Language Recognition
- URL: http://arxiv.org/abs/2201.01364v1
- Date: Tue, 4 Jan 2022 22:10:36 GMT
- Title: A Hierarchical Model for Spoken Language Recognition
- Authors: Luciana Ferrer, Diego Castan, Mitchell McLaren, Aaron Lawson
- Abstract summary: Spoken language recognition ( SLR) refers to the automatic process used to determine the language present in a speech sample.
We propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster.
We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages.
- Score: 29.948719321162883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language recognition (SLR) refers to the automatic process used to
determine the language present in a speech sample. SLR is an important task in
its own right, for example, as a tool to analyze or categorize large amounts of
multi-lingual data. Further, it is also an essential tool for selecting
downstream applications in a work flow, for example, to chose appropriate
speech recognition or machine translation models. SLR systems are usually
composed of two stages, one where an embedding representing the audio sample is
extracted and a second one which computes the final scores for each language.
In this work, we approach the SLR task as a detection problem and implement the
second stage as a probabilistic linear discriminant analysis (PLDA) model. We
show that discriminative training of the PLDA parameters gives large gains with
respect to the usual generative training. Further, we propose a novel
hierarchical approach were two PLDA models are trained, one to generate scores
for clusters of highly related languages and a second one to generate scores
conditional to each cluster. The final language detection scores are computed
as a combination of these two sets of scores. The complete model is trained
discriminatively to optimize a cross-entropy objective. We show that this
hierarchical approach consistently outperforms the non-hierarchical one for
detection of highly related languages, in many cases by large margins. We train
our systems on a collection of datasets including 100 languages and test them
both on matched and mismatched conditions, showing that the gains are robust to
condition mismatch.
Related papers
- Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Multilingual Few-Shot Learning via Language Model Retrieval [18.465566186549072]
Transformer-based language models have achieved remarkable success in few-shot in-context learning.
We conduct a study of retrieving semantically similar few-shot samples and using them as the context.
We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification.
arXiv Detail & Related papers (2023-06-19T14:27:21Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - Analysis of Predictive Coding Models for Phonemic Representation
Learning in Small Datasets [0.0]
The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task.
Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets.
The CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.
arXiv Detail & Related papers (2020-07-08T15:46:13Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.