Fugu-MT 論文翻訳(概要): Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

論文の概要: Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

arxiv url: http://arxiv.org/abs/2201.01615v4
Date: Wed, 9 Aug 2023 14:15:32 GMT
ステータス: 翻訳完了
システム内更新日: 2023-08-10 10:57:01.531245
Title: Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
Title（参考訳）: lawin transformer: 大きなウィンドウアテンションによるマルチスケール表現によるセマンティクスセグメンテーショントランスフォーマの改善
Authors: Haotian Yan and Chuang Zhang and Ming Wu
Abstract要約: マルチスケール表現はセマンティックセグメンテーションに不可欠である。本稿では,ウィンドウアテンション機構を用いたセマンティックセグメンテーション ViT にマルチスケール表現を導入する。得られたViTであるLawin Transformerは、エンコーダとしてHVT、デコーダとしてLawinASPPから構成される。
参考スコア（独自算出の注目度）: 16.75003034164463
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the $\textit{large window attention}$ to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with $\textit{the large window attention}$, which presents a novel decoder named $\textbf{la}$rge $\textbf{win}$dow attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin
Abstract（参考訳）: マルチスケール表現はセマンティックセグメンテーションに不可欠である。コミュニティは、マルチスケールな文脈情報を利用するセマンティックセグメンテーション畳み込みニューラルネットワーク(cnn)の隆盛を目撃している。視覚変換器 (ViT) は画像分類において強力であり, セマンティックセグメンテーション (セマンティックセグメンテーション) も近年提案されている。本稿では,ウィンドウアテンション機構によるセマンティックセグメンテーション ViT へのマルチスケール表現の導入に成功し,性能と効率をさらに向上する。この目的のために、ローカルウィンドウがより広い範囲のコンテキストウインドウを、ほんの少しの計算オーバーヘッドでクエリできるような、大きなウィンドウアテンションを導入する。クエリ領域に対するコンテキスト領域の比率を調節することにより、$\textit{large window attention}$でコンテキスト情報を複数のスケールでキャプチャできる。さらに、空間ピラミッドプーリングのフレームワークは、$\textit{the large window attention}$と協調するために採用され、セマンティックセグメンテーション ViT のための新規デコーダ $\textbf{la}$rge $\textbf{win}$dow attention spatial pyramid pooling (LawinASPP) が提示される。得られたViTであるLawin Transformerは、エンコーダとして効率的な階層型視覚変換器(HVT)、デコーダとしてLawinASPPから構成される。実験の結果, ローリン変圧器は従来の方法よりも効率が良くなることがわかった。 Lawin Transformerはさらに、Cityscapes(84.4% mIoU)、ADE20K(56.2% mIoU)、COCO-Stuffデータセットに新しい最先端パフォーマンスを設定できる。コードはhttps://github.com/yan-hao-tian/lawinでリリースされる。

関連論文リスト

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers [76.13755422671822]
本稿では,エンコーダ・デコーダ・フレームワークを用いた意味的セグメンテーションのためのプレーンビジョン変換器(ViT)の能力について検討する。 Intention-to-Mask(atm)モジュールを導入し、平易なViTに有効な軽量デコーダを設計する。我々のデコーダは、様々なViTバックボーンを使用して人気のあるデコーダUPerNetより優れ、計算コストの5%程度しか消費しない。
論文参考訳（メタデータ） (2023-06-09T22:29:56Z)
Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
本稿では,グローバル・ローカル・ビジョン・トランスフォーマのための新しいセマンティック・トークンViT(STViT)を提案する。提案手法は,対象検出やインスタンスセグメンテーションにおける元のネットワークと比較して,30%以上のFLOPを削減できる。さらに,STViTに基づいて詳細な空間情報を復元するためのSTViT-R(ecover)ネットワークを設計し,下流タスクに有効である。
論文参考訳（メタデータ） (2023-03-15T15:12:36Z)
RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer [63.25665813125223]
本稿では,リアルタイムセマンティックセグメンテーションのための効率的なデュアルレゾリューション変換器RTFormerを提案する。 CNNベースのモデルよりもパフォーマンスと効率のトレードオフが優れている。主要なベンチマーク実験では,提案したRTFormerの有効性を示す。
論文参考訳（メタデータ） (2022-10-13T16:03:53Z)
SSformer: A Lightweight Transformer for Semantic Segmentation [7.787950060560868]
Swin Transformerは階層アーキテクチャとシフトウィンドウを使用して、様々な視覚タスクで新しい記録を樹立した。我々はSSformerと呼ばれる軽量で効果的なトランスモデルを設計する。実験の結果,提案したSSformerは最先端モデルと同等のmIoU性能が得られることがわかった。
論文参考訳（メタデータ） (2022-08-03T12:57:00Z)
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions [109.33112814212129]
本稿では,畳み込みに基づくフレームワークを用いて,入力適応型,長距離,高次空間相互作用を効率的に実装可能であることを示す。本稿では、ゲート畳み込みと高次空間相互作用を行うRecursive Gated Convolution(textitgtextitn$Conv)を提案する。この操作に基づいて,HorNetという汎用視覚バックボーンを新たに構築する。
論文参考訳（メタデータ） (2022-07-28T17:59:02Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
視覚的予測のための視覚変換器(ViT)のグローバルな文脈学習の可能性について検討する。我々のモチベーションは、グローバルコンテキストを全受容界層で学習することで、ViTがより強力な長距離依存性情報を取得することである。階層型ローカル・グローバル・トランスフォーマー (HLG) のファミリを定式化し, 窓内部の局所的な注意と, ピラミッド建築における窓全体のグローバルアテンションを特徴とする。
論文参考訳（メタデータ） (2022-07-19T15:49:35Z)
Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows [57.00864538284686]
Iwin Transformerは階層型トランスフォーマーで、不規則ウィンドウ内でトークン表現学習とトークン集約を行う。 Iwin Transformerの有効性と効率を,2つの標準HOI検出ベンチマークデータセットで検証した。
論文参考訳（メタデータ） (2022-03-20T12:04:50Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
トランスフォーマーはコンピュータビジョンタスクに大きな可能性を示した。最近のTransformerモデルは階層設計を採用しており、セルフアテンションはローカルウィンドウ内でのみ計算される。この設計は効率を大幅に改善するが、早い段階ではグローバルな特徴推論が欠如している。本研究では,トランスフォーマーのマルチパス構造を設計し,各ステージにおける複数の粒度での局所的・言語的推論を可能にする。
論文参考訳（メタデータ） (2021-07-10T02:34:55Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
汎用視覚タスクのための効率的なトランスフォーマーベースバックボーンCSWin Transformerを提案する。トランスフォーマー設計における課題は、グローバルな自己アテンションが計算に非常に高価であるのに対して、ローカルな自己アテンションはトークン間の相互作用のフィールドを制限することが多いことである。
論文参考訳（メタデータ） (2021-07-01T17:59:56Z)
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer [20.92010433074935]
そこで我々はShuffle Transformerという新しい視覚変換器を提案する。提案アーキテクチャは,画像レベルの分類,オブジェクト検出,セマンティックセグメンテーションなど,幅広い視覚的タスクにおいて優れた性能を発揮する。
論文参考訳（メタデータ） (2021-06-07T14:22:07Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。