Related papers: Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

URL: http://arxiv.org/abs/2502.17239v1
Date: Mon, 24 Feb 2025 15:16:34 GMT
Title: Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Authors: Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen,
Abstract summary: Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation.<n>It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
Score: 9.101978573666546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Related papers

Step-Audio 2 Technical Report [108.04129284951314]
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.<n>By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding.
arXiv Detail & Related papers (2025-07-22T14:23:55Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z)
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation [17.56310064245171]
SALMON-omni is a speech understanding and generation model capable of simultaneously listening to its own generated speech sounds while speaking. SALMON-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full- conversational AI systems.
arXiv Detail & Related papers (2024-11-27T08:38:57Z)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer. Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z)
SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network. It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z)
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM [19.36630667212398]
We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis. Our method surpasses existing spoken language models in speaker preservation and semantic coherence.
arXiv Detail & Related papers (2023-05-24T15:39:43Z)
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model) The proposed VATLM employs a unified backbone network to model the modality-independent information. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z)
AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure. We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.