Fugu-MT 論文翻訳(概要): ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

論文の概要: ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

arxiv url: http://arxiv.org/abs/2503.09509v1
Date: Wed, 12 Mar 2025 16:18:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-03-13 21:17:52.841664
Title: ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba
Title（参考訳）: ViM-VQ:視覚マンバの高速後ベクトル量子化
Authors: Juncan Deng, Shuaiting Li, Zeyu Wang, Kedong Xu, Hong Gu, Kejie Huang,
Abstract要約: 視覚マンバネットワーク(ViM)は、選択空間状態モデル(Mamba)を様々な視覚タスクに拡張する。ベクトル量子化(VQ)は、ネットワーク重みをコードブックと割り当てに分解する。本稿では,ViM に適したベクトル量子化手法である ViM-VQ を提案する。
参考スコア（独自算出の注目度）: 7.369445527610879
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visual Mamba networks (ViMs) extend the selective space state model (Mamba) to various vision tasks and demonstrate significant potential. Vector quantization (VQ), on the other hand, decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency to enable ViMs deployment on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.
Abstract（参考訳）: 視覚マンバネットワーク(ViM)は、選択空間状態モデル(Mamba)を様々な視覚タスクに拡張し、大きなポテンシャルを示す。一方、ベクトル量子化(VQ)は、ネットワーク重みをコードブックと代入に分解し、メモリ使用量と計算遅延を大幅に減らし、エッジデバイスへのViMsデプロイを可能にする。既存のVQ法は、畳み込みニューラルネットワークやTransformerベースのネットワークにおいて、非常に低ビット量子化(例えば、3ビット、2ビット、1ビット)を達成しているが、これらの手法を直接ViMに適用することで、不満足な精度が得られる。私たちはいくつかの重要な課題を特定します。 1) ViMs における Mamba ブロックの重みは、多くの外れ値を含み、量子化誤差を著しく増幅する。 2) ViM に適用した場合,最新の VQ 手法では,メモリ消費過多,長期キャリブレーション手順,最適コーデワード探索における最適性能に悩まされる。本稿では,ViM に適した学習後ベクトル量子化手法である ViM-VQ を提案する。 ViM-VQは2つの革新的なコンポーネントから構成される。 1 高速凸組合せ最適化アルゴリズムで、凸組合せと凸殻の両方を効率よく更新し、最適な符号語を検索する。 2) 逐次ベクトル量子化戦略は, 最適符号語を漸進的に確認し, 乱数誤差を緩和する。実験により,ViM-VQは様々な視覚的タスクにおいて,低ビット量子化における最先端性能を実現することが示された。

関連論文リスト

VADMamba: Exploring State Space Models for Fast Video Anomaly Detection [4.874215132369157]
VQ-Mamba Unet(VQ-MaU)フレームワークには、Vector Quantization(VQ)層と、Mambaベースの非負のVisual State Space(NVSS)ブロックが組み込まれている。提案するVADMambaの有効性を3つのベンチマークデータセットで検証した。
論文参考訳（メタデータ） (2025-03-27T05:38:12Z)
OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models [15.757637971482477]
We present OuroMamba, first data-free post-training Quantization (DFQ) method for vision Mamba-based model (VMMs)。 WeoMamba-Gen to generate semantically rich and meaningful synthesis data; 2) OuroMamba-Quant to use mixed-precision Quantization with light dynamic outlier detection during inference。
論文参考訳（メタデータ） (2025-03-13T23:58:55Z)
AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers [42.535119270045605]
後学習量子化(PTQ)は、視覚変換器(ViTs)の記憶と計算コストを削減するための有望なソリューションとして登場した。 AIQViT (Architecture-Informed Post-training Quantization for ViTs) と呼ばれる ViT に適した PTQ 手法を提案する。
論文参考訳（メタデータ） (2025-02-07T03:04:50Z)
PTQ4VM: Post-Training Quantization for Visual Mamba [9.446971590056945]
本稿では,PTS(Per-Token Static Quantization)とJLSS(Joint Learning of Smoothing Scale and Step Size)の2つの主要な戦略を紹介する。 PTQ4VM は様々な Visual Mamba のバックボーンに適用でき、事前訓練されたモデルを15分以内で量子化されたフォーマットに変換する。
論文参考訳（メタデータ） (2024-12-29T07:21:33Z)
V2M: Visual 2-Dimensional Mamba for Image Representation Learning [68.51380287151927]
Mambaは、フレキシブルな設計と、1Dシーケンスを処理するための効率的なハードウェア性能のために、広く注目を集めている。最近の研究では、マンバを2D画像をパッチに平らにすることで視覚領域に適用し、それらを1Dシークエンスとして扱うことが試みられている。 2次元空間における画像トークンを直接処理する完全解として,視覚的2次元マンバモデルを提案する。
論文参考訳（メタデータ） (2024-10-14T11:11:06Z)
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction [60.80423207808076]
効率的なハードウェアを意識した設計のステートスペースモデル(SSM)は、コンピュータビジョンタスクにおいて大きな可能性を証明している。これらのモデルは、誘導バイアスの不足、長距離の忘れ、低解像度の出力表現の3つの主要な課題によって制約されている。本稿では, 変形可能な畳み込みを利用して, 長距離忘れ問題を緩和する動的ビジュアル状態空間(DVSS)ブロックを提案する。また,DVSSブロックに基づく高分解能視覚空間モデル(HRVMamba)を導入し,プロセス全体を通して高分解能表現を保存する。
論文参考訳（メタデータ） (2024-10-04T06:19:29Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) はコンピュータビジョンコミュニティにおいて最も普及しているバックボーンネットワークの1つである。本稿では,AdaLog(Adaptive Logarithm AdaLog)量子化器を提案する。
論文参考訳（メタデータ） (2024-07-17T18:38:48Z)
ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers [7.155242379236052]
ビジョントランスフォーマー(ViT)の量子化は、これらの課題を緩和するための有望なソリューションとして現れている。既存の手法は依然として低ビットでの精度の低下に悩まされている。 ADFQ-ViTは、画像分類、オブジェクト検出、および4ビットでのインスタンスセグメンテーションタスクにおいて、様々なベースラインを大幅に改善する。
論文参考訳（メタデータ） (2024-07-03T02:41:59Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
自己保持機構の計算コストは、長いシーケンスの実用性を制限する。我々はLongVQと呼ばれる新しい手法を提案し、長さ固定されたコードブックとしてグローバルな抽象化を圧縮する。 LongVQは動的グローバルパターンとローカルパターンを効果的に維持し、長距離依存性の問題の欠如を補うのに役立つ。
論文参考訳（メタデータ） (2024-04-17T08:26:34Z)
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization [49.17407185195788]
我々は,ViTのPTQを包括的かつ安定した方法で制御する新しい手法であるI&S-ViTを紹介する。 I&S-ViTは3ビットのViT-Bの性能を50.68%向上させた。
論文参考訳（メタデータ） (2023-11-16T13:07:47Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViTは、Mix-of-experts (MoE)を導入した最新のマルチタスクViTモデルである。 MoEは精度の向上と80%以上の削減計算を実現しているが、FPGAに効率的なデプロイを行う上での課題は残されている。 Edge-MoEと呼ばれる私たちの研究は、アーキテクチャの革新の集合を伴って、マルチタスクのViTのための最初のエンドツーエンドFPGAアクセラレータを導入するという課題を解決します。
論文参考訳（メタデータ） (2023-05-30T02:24:03Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
ビジョントランスフォーマー(ViT)は、様々なコンピュータビジョンタスクに対して印象的なパフォーマンスを実現している。トレーニング済みのViTを効率よく自動圧縮するSPViT(Single-Path Vision Transformer pruning)を提案する。われわれのSPViTはDeiT-Bで52.0%のFLOPをトリミングできる。
論文参考訳（メタデータ） (2021-11-23T11:35:54Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。