Fugu-MT 論文翻訳(概要): FastVLM: Efficient Vision Encoding for Vision Language Models

論文の概要: FastVLM: Efficient Vision Encoding for Vision Language Models

arxiv url: http://arxiv.org/abs/2412.13303v1
Date: Tue, 17 Dec 2024 20:09:55 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-19 16:46:51.979566
Title: FastVLM: Efficient Vision Encoding for Vision Language Models
Title（参考訳）: FastVLM: ビジョン言語モデルのための効率的なビジョンエンコーディング
Authors: Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari,
Abstract要約: 我々は,レイテンシ,モデルサイズ,精度のトレードオフを最適化したモデルであるFastVLMを紹介する。 FastVLMは、より少ないトークンを出力し、高解像度画像の符号化時間を著しく短縮するように設計された、新しいハイブリッドビジョンエンコーダであるFastViTHDを組み込んでいる。
参考スコア（独自算出の注目度）: 22.41836943083826
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2$\times$ improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152$\times$1152), FastVLM achieves comparable performance on key benchmarks like SeedBench and MMMU, using the same 0.5B LLM, but with 85$\times$ faster TTFT and a vision encoder that is 3.4$\times$ smaller.
Abstract（参考訳）: 入力画像解像度のスケーリングは視覚言語モデル(VLM)の性能向上に不可欠である。しかし、ViTのような一般的なビジュアルエンコーダは、大量のトークンと、積み重ねられた自己保持層によって引き起こされる高いエンコード遅延により、高解像度で非効率になる。異なる運用解像度では、VLMのビジョンエンコーダを2つの軸に沿って最適化することができる。本稿では,画像解像度,視覚遅延,トークン数,LLMサイズ間の相互作用の包括的効率解析に基づいて,レイテンシ,モデルサイズ,精度のトレードオフを最適化したモデルであるFastVLMを紹介する。 FastVLMは、より少ないトークンを出力し、高解像度画像の符号化時間を著しく短縮するように設計された、新しいハイブリッドビジョンエンコーダであるFastViTHDを組み込んでいる。従来の手法とは異なり、FastVLMは入力画像をスケーリングするだけで視覚トークンカウントと画像解像度の最適バランスを達成し、追加のトークンプルーニングを不要にし、モデル設計を単純化する。 LLaVA-1.5 のセットアップでは、FastVLM は TTFT (Time-to-first-token) の3.2$\times$の改善を達成し、VLM ベンチマークでは以前の作業と比べて同様の性能を維持している。最高解像度(1152$\times$1152)のLLaVa-OneVisionと比較して、FastVLMは同じ0.5B LLMを使用してSeedBenchやMMMUのような主要なベンチマークで同等のパフォーマンスを達成しているが、85$\times$ faster TTFTと3.4$\times$ smallである。

論文の概要: FastVLM: Efficient Vision Encoding for Vision Language Models

関連論文リスト