Related papers: GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

URL: http://arxiv.org/abs/2512.20978v1
Date: Wed, 24 Dec 2025 06:13:02 GMT
Title: GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model
Authors: Haoyang Li, Xuyi Zhuang, Azmat Adnan, Ye Ni, Wei Rao, Shreyas Gopal, Eng Siong Chng,
Abstract summary: We present GenTSE, a two-stage decoder-only generative LM approach for TSE.<n> separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech.<n>Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
Score: 35.12859489567766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

Related papers

Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers [24.722647001947923]
We propose a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning.<n>We show that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-10-06T08:26:55Z)
High-Fidelity Speech Enhancement via Discrete Audio Tokens [35.61634772862795]
DAC-SE1 is a language model-based SE framework leveraging discrete high-resolution audio representations.<n>Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation.
arXiv Detail & Related papers (2025-10-02T16:38:05Z)
FlowTSE: Target Speaker Extraction with Flow Matching [16.054014378418316]
FlowTSE is a simple yet effective TSE approach based on conditional flow matching.<n>For tasks where phase reconstruction is crucial, we propose a novel vocoder conditioned on the complex STFT of the mixed signal.
arXiv Detail & Related papers (2025-05-20T15:01:30Z)
MTLM: Incorporating Bidirectional Text Information to Enhance Language Model Training in Speech Recognition Systems [8.971049629873185]
MTLM is a novel training paradigm that unifies unidirectional and bidirectional manners through 3 training objectives.<n>It supports multiple decoding strategies, including shallow fusion, unidirectional/bidirectional n-best rescoring.<n>Experiments on the LibriSpeech dataset show that MTLM consistently outperforms unidirectional training across multiple decoding strategies.
arXiv Detail & Related papers (2025-02-14T10:21:10Z)
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification. DASA generates diversified training samples in speaker embedding space with negligible extra computing cost. The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z)
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator. GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising. Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z)
Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z)
E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models. We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2. E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.