DPT: Deformable Patch-based Transformer for Visual Recognition
- URL: http://arxiv.org/abs/2107.14467v1
- Date: Fri, 30 Jul 2021 07:33:17 GMT
- Title: DPT: Deformable Patch-based Transformer for Visual Recognition
- Authors: Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng,
Jinqiao Wang, Ming Tang
- Abstract summary: We propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way.
The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training.
- Score: 57.548916081146814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has achieved great success in computer vision, while how to split
patches in an image remains a problem. Existing methods usually use a
fixed-size patch embedding which might destroy the semantics of objects. To
address this problem, we propose a new Deformable Patch (DePatch) module which
learns to adaptively split the images into patches with different positions and
scales in a data-driven way rather than using predefined fixed patches. In this
way, our method can well preserve the semantics in patches. The DePatch module
can work as a plug-and-play module, which can easily be incorporated into
different transformers to achieve an end-to-end training. We term this
DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and
conduct extensive evaluations of DPT on image classification and object
detection. Results show DPT can achieve 81.9% top-1 accuracy on ImageNet
classification, and 43.7% box mAP with RetinaNet, 44.3% with Mask R-CNN on
MSCOCO object detection. Code has been made available at:
https://github.com/CASIA-IVA-Lab/DPT .
Related papers
- SKU-Patch: Towards Efficient Instance Segmentation for Unseen Objects in
Auto-Store [102.45729472142526]
In large-scale storehouses, precise instance masks are crucial for robotic bin picking.
This paper presents a new patch-guided instance segmentation solution, leveraging only a few image patches for each incoming new SKU.
SKU-Patch yields an average of nearly 100% grasping success rate on more than 50 unseen SKUs in a robot-aided auto-store logistic pipeline.
arXiv Detail & Related papers (2023-11-08T12:44:38Z) - DBAT: Dynamic Backward Attention Transformer for Material Segmentation
with Cross-Resolution Patches [8.812837829361923]
We propose the Dynamic Backward Attention Transformer (DBAT) to aggregate cross-resolution features.
Experiments show that our DBAT achieves an accuracy of 86.85%, which is the best performance among state-of-the-art real-time models.
We further align features to semantic labels, performing network dissection, to infer that the proposed model can extract material-related features better than other methods.
arXiv Detail & Related papers (2023-05-06T03:47:20Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - DeViT: Deformed Vision Transformers in Video Inpainting [59.73019717323264]
We extend previous Transformers with patch alignment by introducing Deformed Patch-based Homography (DePtH)
Second, we introduce Mask Pruning-based Patch Attention (MPPA) to improve patch-wised feature matching.
Third, we introduce a Spatial-Temporal weighting Adaptor (STA) module to obtain accurate attention to spatial-temporal tokens.
arXiv Detail & Related papers (2022-09-28T08:57:14Z) - Patcher: Patch Transformers with Mixture of Experts for Precise Medical
Image Segmentation [17.51577168487812]
We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation.
Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches.
Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel.
arXiv Detail & Related papers (2022-06-03T04:02:39Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Certified Patch Robustness via Smoothed Vision Transformers [77.30663719482924]
We show how using vision transformers enables significantly better certified patch robustness.
These improvements stem from the inherent ability of the vision transformer to gracefully handle largely masked images.
arXiv Detail & Related papers (2021-10-11T17:44:05Z) - Exploring and Improving Mobile Level Vision Transformers [81.7741384218121]
We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop.
We propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance.
arXiv Detail & Related papers (2021-08-30T06:42:49Z) - SimPatch: A Nearest Neighbor Similarity Match between Image Patches [0.0]
We try to use large patches instead of relatively small patches so that each patch contains more information.
We use different feature extraction mechanisms to extract the features of each individual image patches which forms a feature matrix.
The nearest patches are calculated using two different nearest neighbor algorithms in this paper for a query patch for a given image.
arXiv Detail & Related papers (2020-08-07T10:51:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.