Related papers: Qwen3-Omni Technical Report

Qwen3-Omni Technical Report

URL: http://arxiv.org/abs/2509.17765v1
Date: Mon, 22 Sep 2025 13:26:24 GMT
Title: Qwen3-Omni Technical Report
Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin,
Abstract summary: We present Qwen3- Omni, a single multimodal model that maintains state-of-the-art performance across text, image, audio, and video.<n>Qwen3- Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks.<n>It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages.
Score: 105.11829337290249
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

Related papers

Qwen3-TTS Technical Report [64.94647392030824]
We present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models.<n>Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control.<n>Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers.
arXiv Detail & Related papers (2026-01-22T03:51:43Z)
Qwen3-VL Technical Report [153.3964813640593]
Qwen3-VL is the most capable vision-language model to date, achieving superior performance across a broad range of multimodal benchmarks.<n>It supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.<n>Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-token comprehension with a native 256K-token window for both text and interleaved multimodal inputs; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks
arXiv Detail & Related papers (2025-11-26T17:59:08Z)
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [13.05578634768109]
We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec)<n>TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder.<n>It achieves an extremely low frame rate of 6.25 Hz and corresponding compression of 0.0875 kbps with a single-layer codebook for 24 kHz speech.
arXiv Detail & Related papers (2025-08-22T20:45:03Z)
Qwen3 Technical Report [137.96804244102205]
We present Qwen3, the latest version of the Qwen model family.<n>Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities.
arXiv Detail & Related papers (2025-05-14T13:41:34Z)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z)
Qwen2.5-Omni Technical Report [31.033323728153984]
We present an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.<n>Qwen2.5- Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
arXiv Detail & Related papers (2025-03-26T04:17:55Z)
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.