X-SepFormer: End-to-end Speaker Extraction Network with Explicit
Optimization on Speaker Confusion
- URL: http://arxiv.org/abs/2303.05023v1
- Date: Thu, 9 Mar 2023 04:00:29 GMT
- Title: X-SepFormer: End-to-end Speaker Extraction Network with Explicit
Optimization on Speaker Confusion
- Authors: Kai Liu, Ziqing Du, Xucheng Wan, Huan Zhou
- Abstract summary: We present an end-to-end TSE model with proposed loss schemes and a backbone of SepFormer.
With SI-SDRi of 19.4 dB and PESQ of 3.81, our best system significantly outperforms the current SOTA systems.
- Score: 5.4878772986187565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Target speech extraction (TSE) systems are designed to extract target speech
from a multi-talker mixture. The popular training objective for most prior TSE
networks is to enhance reconstruction performance of extracted speech waveform.
However, it has been reported that a TSE system delivers high reconstruction
performance may still suffer low-quality experience problems in practice. One
such experience problem is wrong speaker extraction (called speaker confusion,
SC), which leads to strong negative experience and hampers effective
conversations. To mitigate the imperative SC issue, we reformulate the training
objective and propose two novel loss schemes that explore the metric of
reconstruction improvement performance defined at small chunk-level and
leverage the metric associated distribution information. Both loss schemes aim
to encourage a TSE network to pay attention to those SC chunks based on the
said distribution information. On this basis, we present X-SepFormer, an
end-to-end TSE model with proposed loss schemes and a backbone of SepFormer.
Experimental results on the benchmark WSJ0-2mix dataset validate the
effectiveness of our proposals, showing consistent improvements on SC errors
(by 14.8% relative). Moreover, with SI-SDRi of 19.4 dB and PESQ of 3.81, our
best system significantly outperforms the current SOTA systems and offers the
top TSE results reported till date on the WSJ0-2mix.
Related papers
- Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization [34.51491788470738]
We propose reverse inference optimization (RIO) to enhance the robustness of autoregressive-model-based text-to-speech (TTS) systems.
RIO uses reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself.
RIO significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions.
arXiv Detail & Related papers (2024-07-02T13:04:04Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Improving Target Speaker Extraction with Sparse LDA-transformed Speaker
Embeddings [5.4878772986187565]
We propose a simplified speaker cue with clear class separability for target speaker extraction.
Our proposal shows up to 9.9% relative improvement in SI-SDRi.
With SI-SDRi of 19.4 dB and PESQ of 3.78, our best TSE system significantly outperforms the current SOTA systems.
arXiv Detail & Related papers (2023-01-16T06:30:48Z) - The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge.
We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data.
Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z) - LEAP System for SRE19 CTS Challenge -- Improvements and Error Analysis [36.35711634925221]
We provide a detailed account of the LEAP SRE system submitted to the CTS challenge.
All the systems used the time-delay neural network (TDNN) based x-vector embeddings.
The system combination of generative and neural PLDA models resulted in significant improvements for the SRE evaluation dataset.
arXiv Detail & Related papers (2020-02-07T12:28:56Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.