Related papers: FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

URL: http://arxiv.org/abs/2407.04051v3
Date: Thu, 11 Jul 2024 02:08:35 GMT
Title: FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng,
Abstract summary: FunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs) At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.
Score: 63.8261207950923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

Related papers

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing [37.022292043526186]
VoiceCraft-X is an autoregressive neural language model which unifies multilingual speech editing and Text-to-Speech synthesis.<n> VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data.
arXiv Detail & Related papers (2025-11-15T20:27:25Z)
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play [21.93291433513335]
Voila achieves a response latency just 195 milliseconds, surpassing the average human response time.<n>Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models.<n>Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds.
arXiv Detail & Related papers (2025-05-05T15:05:01Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions [152.41217651729738]
GPT-4o is an omni-modal model that enables vocal conversations with diverse emotions and tones. We propose EMOVA to enable Large Language Models with end-to-end speech capabilities. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks.
arXiv Detail & Related papers (2024-09-26T16:44:02Z)
OpenVoice: Versatile Instant Voice Cloning [22.217256641284106]
We introduce OpenVoice, a versatile voice cloning approach. It requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai.
arXiv Detail & Related papers (2023-12-03T18:41:54Z)
SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network. It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST) We use discretized speech units, which are generated in a fully unsupervised way. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.