Augmenting Convolutional networks with attention-based aggregation
- URL: http://arxiv.org/abs/2112.13692v1
- Date: Mon, 27 Dec 2021 14:05:41 GMT
- Title: Augmenting Convolutional networks with attention-based aggregation
- Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski,
Armand Joulin, Gabriel Synnaeve, Herv\'e J\'egou
- Abstract summary: We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning.
We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth)
It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
- Score: 55.97184767391253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show how to augment any convolutional network with an attention-based
global map to achieve non-local reasoning. We replace the final average pooling
by an attention-based aggregation layer akin to a single transformer block,
that weights how the patches are involved in the classification decision. We
plug this learned aggregation layer with a simplistic patch-based convolutional
network parametrized by 2 parameters (width and depth). In contrast with a
pyramidal design, this architecture family maintains the input patch resolution
across all the layers. It yields surprisingly competitive trade-offs between
accuracy and complexity, in particular in terms of memory consumption, as shown
by our experiments on various computer vision tasks: object classification,
image segmentation and detection.
Related papers
- GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation [33.72549134362884]
We propose GSTran, a novel transformer network tailored for the segmentation task.
The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer.
Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-08-21T12:12:37Z) - Mesh Denoising Transformer [104.5404564075393]
Mesh denoising is aimed at removing noise from input meshes while preserving their feature structures.
SurfaceFormer is a pioneering Transformer-based mesh denoising framework.
New representation known as Local Surface Descriptor captures local geometric intricacies.
Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation.
arXiv Detail & Related papers (2024-05-10T15:27:43Z) - Integrative Feature and Cost Aggregation with Transformers for Dense
Correspondence [63.868905184847954]
The current state-of-the-art are Transformer-based approaches that focus on either feature descriptors or cost volume aggregation.
We propose a novel Transformer-based network that interleaves both forms of aggregations in a way that exploits their complementary information.
We evaluate the effectiveness of the proposed method on dense matching tasks and achieve state-of-the-art performance on all the major benchmarks.
arXiv Detail & Related papers (2022-09-19T03:33:35Z) - Occlusion-Aware Instance Segmentation via BiLayer Network Architectures [73.45922226843435]
We propose Bilayer Convolutional Network (BCNet), where the top layer detects occluding objects (occluders) and the bottom layer infers partially occluded instances (occludees)
We investigate the efficacy of bilayer structure using two popular convolutional network designs, namely, Fully Convolutional Network (FCN) and Graph Convolutional Network (GCN)
arXiv Detail & Related papers (2022-08-08T21:39:26Z) - Semantic Labeling of High Resolution Images Using EfficientUNets and
Transformers [5.177947445379688]
We propose a new segmentation model that combines convolutional neural networks with deep transformers.
Our results demonstrate that the proposed methodology improves segmentation accuracy compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-06-20T12:03:54Z) - DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor
Points [15.953570826460869]
Establishing dense correspondence between two images is a fundamental computer vision problem.
We introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points.
Our method advances the state-of-the-art of correspondence learning on most benchmarks.
arXiv Detail & Related papers (2021-12-13T18:59:30Z) - Auto-Parsing Network for Image Captioning and Visual Question Answering [101.77688388554097]
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures.
Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption.
arXiv Detail & Related papers (2021-08-24T08:14:35Z) - PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy.
Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z) - GAttANet: Global attention agreement for convolutional neural networks [0.0]
Transformer attention architectures, similar to those developed for natural language processing, have recently proved efficient also in vision.
Here, we report experiments with a simple such attention system that can improve the performance of standard convolutional networks.
We demonstrate the usefulness of this brain-inspired Global Attention Agreement network for various convolutional backbones.
arXiv Detail & Related papers (2021-04-12T15:45:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.