Related papers: High-Fidelity Speech Enhancement via Discrete Audio Tokens

High-Fidelity Speech Enhancement via Discrete Audio Tokens

URL: http://arxiv.org/abs/2510.02187v1
Date: Thu, 02 Oct 2025 16:38:05 GMT
Title: High-Fidelity Speech Enhancement via Discrete Audio Tokens
Authors: Luca A. Lanzendörfer, Frédéric Berdoz, Antonis Asonitis, Roger Wattenhofer,
Abstract summary: DAC-SE1 is a language model-based SE framework leveraging discrete high-resolution audio representations.<n>Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation.
Score: 35.61634772862795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

Related papers

PACE: Pretrained Audio Continual Learning [27.605574463021693]
We present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs)<n>In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability.<n>Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-03T10:28:35Z)
Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z)
Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation [10.828717295018123]
We propose a unified embedding framework that eliminates the need for intermediate text representations.<n>Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods.
arXiv Detail & Related papers (2025-01-26T15:04:02Z)
Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.<n>To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.<n>Our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE) We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.