Related papers: UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

URL: http://arxiv.org/abs/2510.20441v1
Date: Thu, 23 Oct 2025 11:22:24 GMT
Title: UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
Authors: Haoyin Yan, Chengwei Liu, Shaofei Xue, Xiaotao Liang, Zheng Xue,
Abstract summary: We propose UniSE, a unified decoder-only LM-based framework to handle different speech enhancement tasks.<n>It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling.<n>Experiments indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines.
Score: 3.855026553620411
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.

Related papers

SLM-SS: Speech Language Model for Generative Speech Separation [47.06391017558454]
We propose SLM-SS, a novel approach that applies speech language models to speech separation.<n>We frame SS as discrete multi-codebook sequence generation, usingcoder models to map quantized speech mixtures to target tokens.<n>Our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks.
arXiv Detail & Related papers (2026-01-27T12:22:43Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing. We reformulate speech processing tasks into speech-to-unit generation tasks. We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z)
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z)
SLM: Bridge the thin gap between speech and text foundation models [45.319071954143325]
Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
arXiv Detail & Related papers (2023-09-30T02:27:45Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text. We first convert all the speech utterances to discrete tokens using an offline neural encoder. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z)
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM [19.36630667212398]
We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis. Our method surpasses existing spoken language models in speaker preservation and semantic coherence.
arXiv Detail & Related papers (2023-05-24T15:39:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.