Fun-Audio-Chat Technical Report
- URL: http://arxiv.org/abs/2512.20156v3
- Date: Tue, 30 Dec 2025 04:08:22 GMT
- Title: Fun-Audio-Chat Technical Report
- Authors: Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, Jingren Zhou,
- Abstract summary: temporal resolution between speech tokens (25Hz) and text tokens (3Hz) mitigates semantic information mismatchs and incurs high computational costs.<n>We introduce Fun-Audio-Chat, a Large-stage Spoken Speech-to-scale tasks.<n>Fun-Audio-Chat 8B and MoE 30BA3B achieve competitive performance on SpeechText and Speech-to-scale tasks.
- Score: 71.07966678560291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .
Related papers
- Step-Audio 2 Technical Report [120.58375054866815]
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.<n>By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding.
arXiv Detail & Related papers (2025-07-22T14:23:55Z) - VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [84.25283710008785]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z) - Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation.<n>It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z) - Qwen2-Audio Technical Report [73.94975476533989]
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio.
Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis.
arXiv Detail & Related papers (2024-07-15T14:38:09Z) - Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities [37.02115473120654]
Augmenting large language models (LLMs) to understand audio is critically important for diverse real-world applications.
In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities.
arXiv Detail & Related papers (2024-02-02T18:58:34Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.