Related papers: A Contrastive Learning Scheme with Transformer Innate Patches

A Contrastive Learning Scheme with Transformer Innate Patches

URL: http://arxiv.org/abs/2303.14806v2
Date: Mon, 8 Jan 2024 12:54:09 GMT
Title: A Contrastive Learning Scheme with Transformer Innate Patches
Authors: Sander Riis{\o}en Jyhne, Per-Arne Andersen, Morten Goodwin
Abstract summary: We present Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint.
Score: 4.588028371034407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents Contrastive Transformer, a contrastive learning scheme using the Transformer innate patches. Contrastive Transformer enables existing contrastive learning techniques, often used for image classification, to benefit dense downstream prediction tasks such as semantic segmentation. The scheme performs supervised patch-level contrastive learning, selecting the patches based on the ground truth mask, subsequently used for hard-negative and hard-positive sampling. The scheme applies to all vision-transformer architectures, is easy to implement, and introduces minimal additional memory footprint. Additionally, the scheme removes the need for huge batch sizes, as each patch is treated as an image. We apply and test Contrastive Transformer for the case of aerial image segmentation, known for low-resolution data, large class imbalance, and similar semantic classes. We perform extensive experiments to show the efficacy of the Contrastive Transformer scheme on the ISPRS Potsdam aerial image segmentation dataset. Additionally, we show the generalizability of our scheme by applying it to multiple inherently different Transformer architectures. Ultimately, the results show a consistent increase in mean IoU across all classes.

Related papers

CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation. We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z)
Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z)
Adaptive Image Transformations for Transfer-based Adversarial Attack [73.74904401540743]
We propose a novel architecture, called Adaptive Image Transformation Learner (AITL) Our elaborately designed learner adaptively selects the most effective combination of image transformations specific to the input image. Our method significantly improves the attack success rates on both normally trained models and defense models under various settings.
arXiv Detail & Related papers (2021-11-27T08:15:44Z)
PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion [37.993611194758195]
We propose a Patch PyramidTransformer(PPT) to address the issues of extracting semantic information from an image. The experimental results demonstrate its superior performance against the state-of-the-art fusion approaches.
arXiv Detail & Related papers (2021-07-29T13:57:45Z)
XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.