Related papers: Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

URL: http://arxiv.org/abs/2601.10770v1
Date: Thu, 15 Jan 2026 13:47:55 GMT
Title: Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers
Authors: Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu, Xiaodong Zeng,
Abstract summary: General-Purpose Audio (GPA) is a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture.<n>GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly TTS, ASR, and VC.
Score: 8.890811356340953
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

Related papers

Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling [52.537908557508324]
HarmoniFuse is a component-selective and prompt-adaptive framework for multi-task speech language modeling.<n>A batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation.
arXiv Detail & Related papers (2025-09-23T02:53:38Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions [48.02083833667388]
We present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions.<n>We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the Large Language Model.
arXiv Detail & Related papers (2025-01-31T18:30:36Z)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST) GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z)
WavLLM: Towards Robust and Adaptive Speech Large Language Model [93.0773293897888]
We introduce WavLLM, a robust and adaptive speech large language model with dual encoders, and a prompt-aware LoRA weight adapter. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set.
arXiv Detail & Related papers (2024-03-31T12:01:32Z)
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text. We first convert all the speech utterances to discrete tokens using an offline neural encoder. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.