Volumetric Transformer Networks
- URL: http://arxiv.org/abs/2007.09433v1
- Date: Sat, 18 Jul 2020 14:00:12 GMT
- Title: Volumetric Transformer Networks
- Authors: Seungryong Kim, Sabine S\"usstrunk, Mathieu Salzmann
- Abstract summary: We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
- Score: 88.85542905676712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing techniques to encode spatial invariance within deep convolutional
neural networks (CNNs) apply the same warping field to all the feature
channels. This does not account for the fact that the individual feature
channels can represent different semantic parts, which can undergo different
spatial transformations w.r.t. a canonical configuration. To overcome this
limitation, we introduce a learnable module, the volumetric transformer network
(VTN), that predicts channel-wise warping fields so as to reconfigure
intermediate CNN features spatially and channel-wisely. We design our VTN as an
encoder-decoder network, with modules dedicated to letting the information flow
across the feature channels, to account for the dependencies between the
semantic parts. We further propose a loss function defined between the warped
features of pairs of instances, which improves the localization ability of VTN.
Our experiments show that VTN consistently boosts the features' representation
power and consequently the networks' accuracy on fine-grained image recognition
and instance-level image retrieval.
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - TBSN: Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising [94.09442506816724]
Blind-spot networks (BSN) have been prevalent network architectures in self-supervised image denoising (SSID)
We present a transformer-based blind-spot network (TBSN) by analyzing and redesigning the transformer operators that meet the blind-spot requirement.
For spatial self-attention, an elaborate mask is applied to the attention matrix to restrict its receptive field, thus mimicking the dilated convolution.
For channel self-attention, we observe that it may leak the blind-spot information when the channel number is greater than spatial size in the deep layers of multi-scale architectures.
arXiv Detail & Related papers (2024-04-11T15:39:10Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Unsupervised domain adaptation semantic segmentation of high-resolution
remote sensing imagery with invariant domain-level context memory [10.210120085157161]
This study proposes a novel unsupervised domain adaptation semantic segmentation network (MemoryAdaptNet) for the semantic segmentation of HRS imagery.
MemoryAdaptNet constructs an output space adversarial learning scheme to bridge the domain distribution discrepancy between source domain and target domain.
Experiments under three cross-domain tasks indicate that our proposed MemoryAdaptNet is remarkably superior to the state-of-the-art methods.
arXiv Detail & Related papers (2022-08-16T12:35:57Z) - MACCIF-TDNN: Multi aspect aggregation of channel and context
interdependence features in TDNN-based speaker verification [5.28889161958623]
We propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN)
The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.
arXiv Detail & Related papers (2021-07-07T09:43:42Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z) - Rotation-Invariant Gait Identification with Quaternion Convolutional
Neural Networks [7.638280076041963]
We introduce Quaternion CNN, a network architecture which is intrinsically layer-wise equivariant and globally invariant under 3D rotations.
We show empirically that this network indeed significantly outperforms a traditional CNN in a multi-user rotation-invariant gait classification setting.
arXiv Detail & Related papers (2020-08-04T23:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.