Learning Decoupling Features Through Orthogonality Regularization
- URL: http://arxiv.org/abs/2203.16772v1
- Date: Thu, 31 Mar 2022 03:18:13 GMT
- Title: Learning Decoupling Features Through Orthogonality Regularization
- Authors: Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou
- Abstract summary: Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
- Score: 55.79910376189138
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Keyword spotting (KWS) and speaker verification (SV) are two important tasks
in speech applications. Research shows that the state-of-art KWS and SV models
are trained independently using different datasets since they expect to learn
distinctive acoustic features. However, humans can distinguish language content
and the speaker identity simultaneously. Motivated by this, we believe it is
important to explore a method that can effectively extract common features
while decoupling task-specific features. Bearing this in mind, a two-branch
deep network (KWS branch and SV branch) with the same network structure is
developed and a novel decoupling feature learning method is proposed to push up
the performance of KWS and SV simultaneously where speaker-invariant keyword
representations and keyword-invariant speaker representations are expected
respectively. Experiments are conducted on Google Speech Commands Dataset
(GSCD). The results demonstrate that the orthogonality regularization helps the
network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively.
Related papers
- WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter.
We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z) - Simultaneous or Sequential Training? How Speech Representations
Cooperate in a Multi-Task Self-Supervised Learning System [12.704529528199064]
Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning.
We study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system.
arXiv Detail & Related papers (2023-06-05T15:35:19Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - Multi-task Learning with Cross Attention for Keyword Spotting [8.103605110339519]
Keywords spotting (KWS) is an important technique for speech applications, which enables users to activate devices by speaking a keyword phrase.
There is a mismatch between the training criterion (phoneme recognition) and the target task (KWS)
Recently, multi-task learning has been applied to KWS to exploit both ASR and KWS training data.
arXiv Detail & Related papers (2021-07-15T22:38:16Z) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Multi-Task Network for Noise-Robust Keyword Spotting and Speaker
Verification using CTC-based Soft VAD and Global Query Attention [13.883985850789443]
Keywords spotting (KWS) and speaker verification (SV) have been studied independently but acoustic and speaker domains are complementary.
We propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information.
arXiv Detail & Related papers (2020-05-08T05:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.