Fugu-MT 論文翻訳(概要): MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

論文の概要: MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

arxiv url: http://arxiv.org/abs/2604.13432v1
Date: Wed, 15 Apr 2026 03:06:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.365397
Title: MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Title（参考訳）: MaMe & MaRe: マトリックスベースのトケマージ・リカバリによる視覚知覚・合成の効率化
Authors: Simin Huo, Ning Li,
Abstract要約: 視覚変換器(ViTs)における自己注意機構の2次複雑さを緩和するためには、トークン圧縮が不可欠である本稿では,行列演算をベースとした,トレーニング不要で微分可能なトークンマージ手法であるMaMeを紹介する。本稿では,その逆操作であるMaReを用いてトークン復元を行い,画像合成のためのMaMe+MaReパイプラインを形成する。
参考スコア（独自算出の注目度）: 2.5885108031811006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.
Abstract（参考訳）: トークン圧縮は、多くの入力トークンを含む視覚変換器(ViT)における自己保持機構の二次的複雑さを軽減するために重要である。 ToMeのような既存のメソッドはGPU非効率な操作(例えばソート、分散書き込み)に依存しており、その効果を制限するオーバーヘッドを導入している。トレーニング不要で差別化可能なトークンマージ手法であるMaMeを導入し,ViTを高速化するためのGPUフレンドリな行列演算手法を提案する。さらに,その逆操作であるMaReを用いてトークン復元を行い,画像合成のためのMaMe+MaReパイプラインを形成する。事前トレーニングされたモデルに適用すると、MaMeはViT-Bスループットを2%の精度で倍増させる。特に、最後の層をMaMeで微調整すると、ViT-Bの精度が1.0%向上する。 SigLIP2-B@512ゼロショット分類では、MaMeは1.3倍の加速と無視可能な性能劣化を提供する。ビデオタスクでは、MaMeはKinetics-400でビデオMAE-Lを48.5%高速化し、精度は0.84%しか低下しない。さらに、MaMeは、いくつかのタスクにおけるパフォーマンスとスピードの両方を同時に改善する。画像合成において、MaMe+MaReパイプラインは、安定拡散v2.1生成遅延を31%削減しながら、品質を向上させる。これらの結果は、視覚モデルの加速におけるMaMeとMaReの有効性を総合的に示すものである。コードはhttps://github.com/cominder/mame}{https://github.com/cominder/mameで入手できる。

論文の概要: MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

関連論文リスト