Fugu-MT 論文翻訳(概要): Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

論文の概要: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

arxiv url: http://arxiv.org/abs/2506.05709v1
Date: Fri, 06 Jun 2025 03:18:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-09 17:28:43.305949
Title: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Title（参考訳）: Token Transforming: ビジョントランスフォーマーアクセラレーションのための統一的でトレーニング不要なToken Compressionフレームワーク
Authors: Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang,
Abstract要約: 本稿では,既存のすべてのメソッドを一般化する多対多のToken変換フレームワークを提案する。具体的には、40%のFLOPを減らし、DeiT-Sを1.5ドル、限界0.1%の精度低下で加速する。本手法をセグメント化,オブジェクト検出,深さ推定,言語モデル生成など,高密度な予測タスクに拡張する。
参考スコア（独自算出の注目度）: 8.584066042703972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.
Abstract（参考訳）: 視覚変換器は様々な視覚タスクで広く研究されている。計算コストの重いため、トークンの面において視覚変換器を動的に圧縮するために多くの関心が喚起されている。現在の方法では、トークンのプルーニングやマージに注意を払ってトークン数を減らし、トークンのみを圧縮し、大きな情報損失を引き起こすため、パフォーマンスの回復には必然的にポストトレーニングが必要である。本稿では,トークンの低減とプロセスの統一化を,トークン行列変換の明示的な形式として再考する。さらに,既存のすべてのメソッドの一般化に役立ち,トレーニング不要なアクセラレーションを実現した多対多のToken変換フレームワークを提案する。フレームワークを検証するために、広範な実験を行います。具体的には、40%のFLOPを減らし、DeiT-Sを1.5ドル急ぐ。さらに,本手法をセグメント化,オブジェクト検出,深さ推定,言語モデル生成などの高密度な予測タスクに拡張する。提案手法は, 計算性能のトレードオフ, 予算削減, 推論高速化など, 大幅な改善を継続的に達成できることを示す。

関連論文リスト

ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference [12.986605266786839]
そこで我々は,各段階で重要なトークンを識別し,重要でないトークンを一時的に凍結する新しいToken Freezing and Reusingフレームワークを紹介した。 ToFeは、トップ1精度の2%以下でLV-ViTモデルの計算コストを50%削減する。
論文参考訳（メタデータ） (2025-07-22T06:17:44Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
トーケン圧縮は、トランスモデルの計算およびメモリ要求の低減に不可欠である。本稿では,Prune と Merge という,効率的なハードウェア互換のトークン圧縮手法を提案する。
論文参考訳（メタデータ） (2025-03-30T14:23:18Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
本稿では,CageViTと呼ばれる効率的な視覚変換器を提案する。私たちのCageViTは、現在のTransformersとは違って、新しいエンコーダを使用して、再配置されたトークンを処理する。実験の結果,提案したCageViTは最新の最先端のバックボーンよりも効率の面で大きな差があることがわかった。
論文参考訳（メタデータ） (2023-05-17T03:19:18Z)
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning [28.180891300826165]
大規模視覚変換器におけるトークンの総数を削減するために、多くの先進的なアプローチが開発されている。 2つの非パラメトリック演算子、トークン数を減らすトークンクラスタリング層、トークン数を増やすトークン再構成層を提供する。その結果、オブジェクト検出、セマンティックセグメンテーション、パノスコープセグメンテーション、インスタンスセグメンテーション、深さ推定を含む5つの密集した予測タスクが期待できる。
論文参考訳（メタデータ） (2022-10-03T15:49:48Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
最小限のコストで適応的なスパーストークンプルーニングフレームワークを提案する。提案手法では,DeiT-Sのスループットを50%向上し,トップ1の精度は0.2%低下した。
論文参考訳（メタデータ） (2022-09-28T03:07:32Z)
Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks [88.77951448313486]
視覚データにおける空間空間空間性を利用したモデルアクセラレーションのための新しい手法を提案する。本稿では,冗長トークンを具現化する動的トークンスペーシフィケーションフレームワークを提案する。提案手法は,CNNや階層型視覚変換器などの階層モデルに拡張する。
論文参考訳（メタデータ） (2022-07-04T17:00:51Z)
Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer [63.99222215387881]
本稿では,視覚変換器の自己モチベーションの遅いトークン進化手法であるEvo-ViTを提案する。本手法は,画像分類において同等の性能を維持しつつ,視覚変換器の計算コストを大幅に削減することができる。
論文参考訳（メタデータ） (2021-08-03T09:56:07Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
入力に基づいて冗長なトークンを抽出する動的トークンスペーシフィケーションフレームワークを提案する。入力トークンの66%を階層的にプルーニングすることで,FLOPの31%37%を大幅に削減し,スループットを40%以上向上する。 DynamicViTモデルは、ImageNetの最先端CNNやビジョントランスフォーマーと比較して、非常に競争力のある複雑性/精度のトレードオフを実現することができる。
論文参考訳（メタデータ） (2021-06-03T17:57:41Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
本稿では,隠れ状態の列を短く圧縮するFunnel-Transformerを提案する。 Funnel-TransformerはFLOPに匹敵する数が少ないため、様々なシーケンスレベルの予測タスクにおいて標準のTransformerよりも優れている。
論文参考訳（メタデータ） (2020-06-05T05:16:23Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。