Related papers: Transducer-based language embedding for spoken language identification

Transducer-based language embedding for spoken language identification

URL: http://arxiv.org/abs/2204.03888v1
Date: Fri, 8 Apr 2022 07:23:43 GMT
Title: Transducer-based language embedding for spoken language identification
Authors: Peng Shen, Xugang Lu, Hisashi Kawai
Abstract summary: The acoustic and linguistic features are important cues for the spoken language identification task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. We propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework.
Score: 38.60303603000269
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were carried out on the large-scale multilingual LibriSpeech and VoxLingua107 datasets. Experimental results showed the proposed method significantly improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively.

Related papers

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively. However, Whisper struggles with unseen languages, those not included in its pre-training. We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9 [4.328586290529485]
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset.
arXiv Detail & Related papers (2024-06-17T06:19:14Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Generative linguistic representation for spoken language identification [17.9575874225144]
We explore the utilization of the decoder-based network from the Whisper model to extract linguistic features. We devised two strategies - one based on the language embedding method and the other focusing on direct optimization of LID outputs. We conducted experiments on the large-scale multilingual datasets MLS, VoxLingua107, and CommonVoice to test our approach.
arXiv Detail & Related papers (2023-12-18T06:40:24Z)
Semantic enrichment towards efficient speech representations [9.30840529284715]
This study investigates a specific in-domain semantic enrichment of the SAMU-XLSR model. We show the benefits of the use of same-domain French and Italian benchmarks for low-resource language portability.
arXiv Detail & Related papers (2023-07-03T19:52:56Z)
Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z)
LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z)
Is Attention always needed? A Case Study on Language Identification from Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID. CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z)
Rnn-transducer with language bias for end-to-end Mandarin-English code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem. We use the language identities to bias the model to predict the CS points. This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.