Fugu-MT 論文翻訳(概要): AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

論文の概要: AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

arxiv url: http://arxiv.org/abs/2510.11496v2
Date: Tue, 14 Oct 2025 05:05:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 12:06:24.265238
Title: AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
Title（参考訳）: AndesVL Technical Report: 効率的なモバイル側マルチモーダル言語モデル
Authors: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu,
Abstract要約: AndesVLはQwen3のLLMと様々なビジュアルエンコーダに基づいて0.6Bから4Bのパラメータを持つモバイル側のMLLMのスイートである。効率的なタスク適応とモデル圧縮を容易にするために,Quantization-Aware LoRA Fine-Tuningフレームワークとともに1+N LoRAアーキテクチャを導入する。我々は、MediaTek Dimensity 9500チップにAndesVL-4Bをデプロイする際に、最大6.7倍のピーク復号率、最大30.9%のメモリ削減、1.8ビット/ウェイトを実現した。
参考スコア（独自算出の注目度）: 40.488271586857884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model compression during mobile-side deployment of AndesVL. Moreover, utilizing our cache eviction algorithm -- OKV -- along with customized speculative decoding and compression strategies, we achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.
Abstract（参考訳）: 近年、QwenVL、InternVL、GPT-4o、Gemini、Claude SonnetといったクラウドベースのMLLMは、数十億のパラメータに到達した巨大なモデルサイズで優れた性能を示してきたが、携帯電話などのエッジデバイスのメモリ、消費電力、計算能力の限界を大幅に超えている。本稿では,Qwen3のLSMと様々なビジュアルエンコーダに基づいて,0.6Bから4Bのパラメータを持つ移動体MLLMのスイートであるAndesVLを紹介する。我々は、テキストリッチな画像理解、推論と数学、マルチイメージ理解、一般的なVQA、幻覚軽減、多言語理解、GUI関連タスクなど、さまざまなオープンソースベンチマークにおいて、一級のパフォーマンスを実現するAndesVLのモデルアーキテクチャ、トレーニングパイプライン、トレーニングデータを包括的に概説する。さらに、1+N LoRAアーキテクチャとQuantization-Aware LoRA Fine-Tuning (QALFT) フレームワークを導入し、AndesVLのモバイル側デプロイ時の効率的なタスク適応とモデル圧縮を容易にする。さらに、キャッシュ消去アルゴリズム -- OKV -- と、カスタマイズされた投機的復号化と圧縮戦略により、MediaTek Dimensity 9500チップにAndesVL-4Bをデプロイする際の6.7倍のピーク復号化率、最大30.9%のメモリ削減、1.8ビット毎の軽量化を実現した。すべてのモデルをhttps://huggingface.co/OPPOer.comでリリースします。

論文の概要: AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

関連論文リスト