G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Children's Speaker Verification
- URL: http://arxiv.org/abs/2508.07836v1
- Date: Mon, 11 Aug 2025 10:41:56 GMT
- Title: G-IFT: A Gated Linear Unit adapter with Iterative Fine-Tuning for Low-Resource Children's Speaker Verification
- Authors: Vishwas M. Shetty, Jiusi Zheng, Abeer Alwan,
- Abstract summary: We propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT)<n>In this paper, we propose a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT) to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children's speech domain.
- Score: 18.19235178193197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speaker Verification (SV) systems trained on adults speech often underperform on children's SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource children's speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods.
Related papers
- Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation [90.5844979560448]
VocAlign is a source-free domain adaptation framework specifically designed for VLMs in semantic segmentation.<n>Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks.
arXiv Detail & Related papers (2025-09-18T17:59:58Z) - SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR [65.90944188787786]
Low-rank adaptation (LoRA) is widely used in speech applications, but its state-of-the-art variants, e.g., VeRA, DoRA, PiSSA, and SVFT, are developed mainly for language and vision tasks, with limited validation in speech.<n>This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet.<n>We evaluate all methods on domain-shifted speech recognition tasks, including child speech and dialectal variation, across model scales from 0.1B to 2B.
arXiv Detail & Related papers (2025-09-02T20:51:17Z) - DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image.<n> Vision-Language Pre-training models offer a strong open-vocabulary foundation, but struggle with fine-grained localization under weak supervision.<n>We propose the Dual Adaptive Refinement Transfer (DART) framework to overcome these limitations.
arXiv Detail & Related papers (2025-08-07T17:22:33Z) - The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities [0.28675177318965045]
This paper focuses on vowel phonemes classification and speakers recognition for the Automatic Speech Recognition domain.
For our case-study, the ASR model runs on a proprietary sensing and lightning system, exploited to monitor acoustic and air pollution on urban streets.
We formalize combinations of pseudo-Neural Architecture Search and Hyper-s Tuning experiments, using an informed grid-search methodology, to achieve classification accuracy comparable to nowadays most complex architectures.
arXiv Detail & Related papers (2024-10-05T09:47:54Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Efficient acoustic feature transformation in mismatched environments
using a Guided-GAN [1.495380389108477]
We propose a new framework to improve automatic speech recognition systems in resource-scarce environments.
We use a generative adversarial network (GAN) operating on acoustic input features to enhance the features of mismatched data.
With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER)
arXiv Detail & Related papers (2022-10-03T05:33:28Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
arXiv Detail & Related papers (2021-12-02T07:26:34Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Relational Teacher Student Learning with Neural Label Embedding for
Device Adaptation in Acoustic Scene Classification [49.0621360050418]
We propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification.
Taking into account the structural relationships between acoustic scene classes, our proposed framework captures such relationships which are intrinsically device-independent.
In the training stage, transferable knowledge is condensed in NLE from the source domain.
In the adaptation stage, a novel RTSL strategy is adopted to learn adapted target models without using paired source-target data.
arXiv Detail & Related papers (2020-07-31T23:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.