Fugu-MT 論文翻訳(概要): InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

論文の概要: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arxiv url: http://arxiv.org/abs/2508.18265v1
Date: Mon, 25 Aug 2025 17:58:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.905432
Title: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Title（参考訳）: InternVL3.5:Versatility、Reasoning、Efficencyにおけるオープンソースのマルチモーダルモデルの改善
Authors: Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo,
Abstract要約: InternVL 3.5は、多目的性、推論能力、推論効率を大幅に向上させる、オープンソースの新しいマルチモーダルモデルである。主要なイノベーションはCascade Reinforcement Learningフレームワークで、2段階のプロセスを通じて推論を強化する。我々の最大のモデルであるInternVL3.5-241B-A28Bは、一般的なマルチモーダル、推論、テキスト、エージェントタスクにわたるオープンソースのMLLMの最先端の結果を得る。
参考スコア（独自算出の注目度）: 245.85790868739238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
Abstract（参考訳）: InternVL 3.5は、オープンソースマルチモーダルモデルの新たなファミリーであり、InternVLシリーズに沿った多目的性、推論能力、推論効率を大幅に向上させる。 Cascade Reinforcement Learning (Cascade RL)フレームワークは、2段階のプロセスを通じて推論を強化する。この粗大なトレーニング戦略は、ダウンストリーム推論タスク(例えば、MMMU、MathVista)を大幅に改善する。性能を損なうことなく視覚トークンの解像度を動的に調整するビジュアルレゾリューションルータ(ViR)を提案する。 ViRと組み合わせることで、Decoupled Vision-Language Deployment(DvD)戦略は、さまざまなGPU間でビジョンエンコーダと言語モデルを分離し、計算負荷を効果的にバランスさせます。これらのコントリビューションにより、InternVL3.5は全体の推論性能が+16.0%向上し、InternVL3よりも4.05$\times$推論速度が向上する。さらに、InternVL3.5はGUIインタラクションやエンボディエージェントなどの新機能をサポートする。特に、我々の最大のモデルであるInternVL3.5-241B-A28Bは、一般的なマルチモーダル、推論、テキスト、エージェントタスクにわたるオープンソースのMLLMの最先端の成果を達成します。すべてのモデルとコードは公開されています。

論文の概要: InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

関連論文リスト