Fugu-MT 論文翻訳(概要): Qwen3.5-Omni Technical Report

論文の概要: Qwen3.5-Omni Technical Report

arxiv url: http://arxiv.org/abs/2604.15804v2
Date: Tue, 21 Apr 2026 03:35:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 14:04:47.698914
Title: Qwen3.5-Omni Technical Report
Title（参考訳）: Qwen3.5-Omni技術報告
Authors: Qwen Team,
Abstract要約: Qwen3.5- Omniは数十億のパラメータにスケールし、256kのコンテキスト長をサポートする。 Qwen3.5-Omniは、215のオーディオとオーディオの視覚的理解、推論、インタラクションのサブタスクとベンチマークで結果を得る。 ARIAはテキストと音声ユニットを動的に調整し、会話音声の安定性と韻律を大幅に向上させる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.
Abstract（参考訳）: 本稿では,Qwen-Omniモデルファミリーの最新の進歩であるQwen3.5-Omniを紹介する。 Qwen3.5-Omniは、先代に比べて、数十億のパラメータにスケールし、256kのコンテキスト長をサポートする。ヘテロジニアスなテキストビジョンペアと1億時間以上のオーディオ視覚コンテンツからなる巨大なデータセットを活用することで、このモデルは、堅牢なオムニモダリティ能力を示す。 Qwen3.5-Omni-plusは、215のオーディオおよびオーディオ視覚的理解、推論、相互作用サブタスクおよびベンチマークでSOTA結果を達成し、主要なオーディオタスクにおいてGemini-3.1 Proを上回り、包括的なオーディオ視覚的理解でそれをマッチングする。アーキテクチャ上、Qwen3.5-OmniはThinkerとTalkerの両方にHybrid Attention Mixture-of-Experts (MoE)フレームワークを採用している。このモデルは洗練されたインタラクションを促進し、10時間以上の音声理解と720Pビデオの400秒(1 FPS)をサポートする。音声合成における本質的な不安定性と不自然な問題に対処するため,テキストと音声トークンの効率の相違を符号化することでしばしば生じるARIAを紹介した。 ARIAはテキストと音声ユニットを動的に調整し、最小遅延の影響で会話音声の安定性と韻律を大幅に向上させる。さらに、Qwen3.5-Omniは言語境界を拡張し、人間のような感情的なニュアンスを持つ10言語にわたる多言語理解と音声生成をサポートする。最後に、Qwen3.5-Omniは、正確な時間同期と自動シーンセグメンテーションを備えたスクリプトレベルの構造化キャプションを生成する、優れた音声-視覚的グラウンド機能を示す。特筆すべきは、オーディオ・ビジュアル・バイブ・コーディング(Audio-Visual Vibe Coding)と呼ばれるオーディオ・ビジュアル・インストラクションに基づいて直接コーディングを行うという、一様モデルにおける新しい機能の出現である。

論文の概要: Qwen3.5-Omni Technical Report

関連論文リスト