Explore the Reinforcement Learning for the LLM based ASR and TTS system
- URL: http://arxiv.org/abs/2509.18569v1
- Date: Tue, 23 Sep 2025 02:52:54 GMT
- Title: Explore the Reinforcement Learning for the LLM based ASR and TTS system
- Authors: Changfeng Gao, Yabin Li, Keyu An, Zhifu Gao, Zhihao Du, Han Zhao, Xiangang Li,
- Abstract summary: Large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems.<n>In this study, we propose a lightweight reinforcement learning framework tailored for audio-based LLMs.<n>Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems.
- Score: 22.18395435959418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.
Related papers
- Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models [34.15708407614003]
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities.<n>We present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation.<n> Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-10T16:03:44Z) - FunAudio-ASR Technical Report [89.84148151617022]
We present FunAudio-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z) - Differentiable Reward Optimization for LLM based TTS system [46.658935067247945]
This paper proposes a novel Differentiable Reward Optimization (DiffRO) method aimed at enhancing the performance of neural language models based text-to-speech (TTS) systems.<n>In contrast to conventional reinforcement learning from human feedback (RLHF), DiffRO directly computes the rewards based on neural tokens, rather than relying on synthesized audio.<n>We introduce a multi-task reward (MTR) model which can provide feedback from different perspectives and find that it can augment the system's capability to follow instructions effectively.
arXiv Detail & Related papers (2025-07-08T11:57:16Z) - Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue [17.47550065558479]
Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems.
Existing RL methods tend to mainly focus on generation tasks, while neglecting dialogue state tracking (DST) for understanding.
We introduce step-by-step rewards throughout the token generation to extend RL into both understanding and generation tasks.
arXiv Detail & Related papers (2024-06-20T16:15:40Z) - An Embarrassingly Simple Approach for LLM with Strong ASR Capacity [56.30595787061546]
We focus on solving one of the most important tasks in the field of speech processing, with speech foundation encoders and large language models (LLM)
Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM.
We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task.
arXiv Detail & Related papers (2024-02-13T23:25:04Z) - A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
Scale [64.10124092250126]
Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus.
In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting.
We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
arXiv Detail & Related papers (2023-04-19T18:09:27Z) - Effect and Analysis of Large-scale Language Model Rescoring on
Competitive ASR Systems [30.873546090458678]
Large-scale language models (LLMs) have been successfully applied to ASR N-best rescoring.
In this study, we incorporate LLM rescoring into one of the most competitive ASR baselines: the Conformer-Transducer model.
arXiv Detail & Related papers (2022-04-01T05:20:55Z) - Long-Running Speech Recognizer:An End-to-End Multi-Task Learning
Framework for Online ASR and VAD [10.168591454648123]
This paper presents a novel end-to-end (E2E), multi-task learning (MTL) framework that integrates ASR and VAD into one model.
The proposed system, which we refer to as Long-Running Speech Recognizer (LR-SR), learns ASR and VAD jointly from two seperate task-specific datasets in the training stage.
In the inference stage, the LR-SR system removes non-speech parts at low computational cost and recognizes speech parts with high robustness.
arXiv Detail & Related papers (2021-03-02T11:49:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.