Fugu-MT 論文翻訳(概要): ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

論文の概要: ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

arxiv url: http://arxiv.org/abs/2510.12793v1
Date: Tue, 14 Oct 2025 17:58:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.441846
Title: ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Title（参考訳）: ViCO: セマンティック・アウェア・ダイナミック・ハイ・リゾリューションのためのトレーニング戦略
Authors: Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, Wenhai Wang,
Abstract要約: 既存のMLLM(Multimodal Large Language Models)は、画像入力によって導入された視覚トークンの追加により、推論コストが増大する。本研究では,異なる数の視覚トークンを用いて,様々な複雑度の画像を表現可能な,新しい学習アルゴリズムであるVisual Consistency Learning (ViCO)を提案する。実験の結果,モデルの知覚,推論,OCR能力を維持しつつ,視覚トークンの数を最大50%削減できることがわかった。
参考スコア（独自算出の注目度）: 71.69364653858447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model's perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.
Abstract（参考訳）: 既存のMLLM(Multimodal Large Language Models)は、画像入力によって導入された視覚トークンの追加により、推論コストが増大する。本研究では,視覚トークンの個数を用いて,様々な意味複雑性の画像を表現可能な新しい学習アルゴリズムであるビジュアル一貫性学習(ViCO)を提案する。提案手法の背景にある鍵となる考え方は,複数のMLPコネクタをそれぞれ異なる画像圧縮比で使用し,画像の意味的複雑さに基づいて視覚トークンをダウンサンプリングすることである。トレーニング中、異なるMLPコネクタ上で条件付けられた応答間のKLのばらつきを最小限に抑える。推定時に各画像パッチに対する適切な圧縮率を自動的に選択するVisual Resolution Router (ViR) と呼ばれる画像ルータを導入する。画像解像度に基づいて視覚トークンの数を調節する既存の動的高解像度戦略と比較して,本手法は意味複雑性に応じて視覚トークンの数を動的に適応させる。実験の結果,モデルの知覚,推論,OCR能力を維持しつつ,視覚トークンの数を最大50%削減できることがわかった。この研究がより効率的なMLLMの開発に寄与することを願っています。コードとモデルは、将来の研究を促進するためにリリースされる。

論文の概要: ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

関連論文リスト