Generative linguistic representation for spoken language identification
- URL: http://arxiv.org/abs/2312.10964v1
- Date: Mon, 18 Dec 2023 06:40:24 GMT
- Title: Generative linguistic representation for spoken language identification
- Authors: Peng Shen, Xuguang Lu, Hisashi Kawai
- Abstract summary: We explore the utilization of the decoder-based network from the Whisper model to extract linguistic features.
We devised two strategies - one based on the language embedding method and the other focusing on direct optimization of LID outputs.
We conducted experiments on the large-scale multilingual datasets MLS, VoxLingua107, and CommonVoice to test our approach.
- Score: 17.9575874225144
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Effective extraction and application of linguistic features are central to
the enhancement of spoken Language IDentification (LID) performance. With the
success of recent large models, such as GPT and Whisper, the potential to
leverage such pre-trained models for extracting linguistic features for LID
tasks has become a promising area of research. In this paper, we explore the
utilization of the decoder-based network from the Whisper model to extract
linguistic features through its generative mechanism for improving the
classification accuracy in LID tasks. We devised two strategies - one based on
the language embedding method and the other focusing on direct optimization of
LID outputs while simultaneously enhancing the speech recognition tasks. We
conducted experiments on the large-scale multilingual datasets MLS,
VoxLingua107, and CommonVoice to test our approach. The experimental results
demonstrated the effectiveness of the proposed method on both in-domain and
out-of-domain datasets for LID tasks.
Related papers
- Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model [12.030995417911296]
This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups.
Within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language.
Our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
arXiv Detail & Related papers (2024-09-03T16:53:38Z) - OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion [88.59397418187226]
We propose a novel unified open-vocabulary detection method called OV-DINO.
It is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks.
arXiv Detail & Related papers (2024-07-10T17:05:49Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Adaptive Activation Network For Low Resource Multilingual Speech
Recognition [30.460501537763736]
We introduce an adaptive activation network to the upper layers of ASR model.
We also proposed two approaches to train the model: (1) cross-lingual learning, replacing the activation function from source language to target language, and (2) multilingual learning.
Our experiments on IARPA Babel datasets demonstrated that our approaches outperform the from-scratch training and traditional bottleneck feature based methods.
arXiv Detail & Related papers (2022-05-28T04:02:59Z) - Transducer-based language embedding for spoken language identification [38.60303603000269]
The acoustic and linguistic features are important cues for the spoken language identification task.
Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding.
We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework.
arXiv Detail & Related papers (2022-04-08T07:23:43Z) - Multilingual Speech Recognition using Knowledge Transfer across Learning
Processes [15.927513451432946]
Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER.
A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.
arXiv Detail & Related papers (2021-10-15T07:50:27Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.