Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
- URL: http://arxiv.org/abs/2505.02707v1
- Date: Mon, 05 May 2025 15:05:01 GMT
- Title: Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
- Authors: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu,
- Abstract summary: Voila achieves a response latency just 195 milliseconds, surpassing the average human response time.<n>Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models.<n>Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds.
- Score: 21.93291433513335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.
Related papers
- MultiVox: Benchmarking Voice Assistants for Multimodal Interactions [43.55740197419447]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction [123.89581506075461]
We propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency.<n> Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction.<n>Our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
arXiv Detail & Related papers (2025-05-26T17:55:06Z) - CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z) - MinMo: A Multimodal Large Language Model for Seamless Voice Interaction [73.39573341265027]
We introduce MinMo, a Multimodal Large Language Model for seamless voice interaction.<n>We train MinMo through stages speechtext-to-speech alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction.<n>After the multi-text training, MinMo state-of-the-art performance across various benchmarks for voice comprehension and generation.
arXiv Detail & Related papers (2025-01-10T15:55:27Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming [0.0]
Mini- Omni is an audio-based end-to-end conversational model capable of real-time speech interaction.
We propose a text-instructed speech generation method, along with batch-parallel strategies during inference to boost the performance.
We also introduce the VoiceAssistant-400K dataset to fine-tune models for optimized speech output.
arXiv Detail & Related papers (2024-08-29T17:18:53Z) - FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs [63.8261207950923]
FunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs)
At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity.
The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.
arXiv Detail & Related papers (2024-07-04T16:49:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.