Fugu-MT 論文翻訳(概要): Valley3: Scaling Omni Foundation Models for E-commerce

論文の概要: Valley3: Scaling Omni Foundation Models for E-commerce

arxiv url: http://arxiv.org/abs/2605.01278v1
Date: Sat, 02 May 2026 06:25:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.679618
Title: Valley3: Scaling Omni Foundation Models for E-commerce
Title（参考訳）: Valley3: 電子商取引のためのOmni Foundation Modelのスケーリング
Authors: Zeyu Chen, Guanghao Zhou, Qixiang Yin, Ziwang Zhao, Huanjin Yao, Pengjiu Xia, Min Yang, Cen Chen, Minghui Qiu,
Abstract要約: 多様なグローバルeコマースタスクのために開発された,Omni Multimodal Large Language Model (MLLM) であるValley3を提案する。 Valley3のキーとなる機能は、視覚言語モデルを拡張して開発されたeコマース用のネイティブな多言語オーディオ機能である。探索ツールを積極的に起動し,タスク関連情報を取得するエージェント検索機能をValley3に装備する。
参考スコア（独自算出の注目度）: 26.764304741635495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.
Abstract（参考訳）: 本研究では,テキスト,画像,ビデオ,音声の統一的理解と推論機能を備えた,多種多様なグローバルeコマースタスク用に開発されたOmni Multimodal Large Language Model (MLLM) であるValley3を提案する。 Valley3の重要な機能は、Eコマースのためのネイティブな多言語オーディオ機能で、特にショートビデオシナリオにおいて、重要なオーディオ視覚タスクをサポートするために視覚言語モデルを拡張することで開発された。そこで、Valley3は、音声理解、クロスモーダルなインストラクションフォロー、eコマースドメイン知識、長期コンテキスト推論能力などを段階的に獲得し、最終的には多様なeコマースシナリオのためのOmniモデルへと発展させます。次に、制御可能な推論モードによるロングチェーン推論を奨励し、1つの非思考モードと3つの異なるレベルの思考を可能にし、複雑なアプリケーションに対する深い推論を伴う単純なシナリオにおける推論効率のバランスをとる。さらに,Valley3にエージェント検索機能を設け,検索ツールを積極的に起動し,電子商取引深層調査タスクのタスク関連情報を取得する。 Valley3の能力を総合的に評価するために、6つのタスクにまたがるOmni eコマースベンチマークを構築した。実験の結果、Valley3は当社の社内およびオープンソースEコマースベンチマークのベースラインを一貫して上回り、一般のベンチマークでは競争力を維持しています。

論文の概要: Valley3: Scaling Omni Foundation Models for E-commerce

関連論文リスト