Related papers: ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

URL: http://arxiv.org/abs/2409.07541v1
Date: Wed, 11 Sep 2024 18:03:59 GMT
Title: ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers
Authors: Giorgos Savathrakis, Antonis Argyros,
Abstract summary: Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. We propose to cluster the transformer input on the basis of its entropy. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at https://github.com/GSavathrakis/ENACT

Related papers

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model [10.473819332984005]
We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
arXiv Detail & Related papers (2023-05-24T03:47:22Z)
Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers [34.42710399235461]
Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. They suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders. We propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features.
arXiv Detail & Related papers (2023-03-26T20:50:58Z)
Transformers for Object Detection in Large Point Clouds [9.287964414592826]
We present TransLPC, a novel detection model for large point clouds based on a transformer architecture. We propose a novel query refinement technique to improve detection accuracy, while retaining a memory-friendly number of transformer decoder queries. This simple technique has a significant effect on detection accuracy, which is evaluated on the challenging nuScenes dataset on real-world lidar data.
arXiv Detail & Related papers (2022-09-30T06:35:43Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z)
An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector. We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z)
ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection. vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer. Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image. Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
End-to-End Object Detection with Adaptive Clustering Transformer [37.9114488933667]
A novel variant of Transformer named Adaptive Clustering Transformer(ACT) has been proposed to reduce the computation cost for high-resolution input. ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and ap-proximate the query-key interaction. Code is released as supplementary for the ease of experiment replication and verification.
arXiv Detail & Related papers (2020-11-18T14:36:37Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
Algorithm-hardware Co-design for Deformable Convolution [40.50544352625659]
We build an efficient object detection network with modified deformable convolutions and quantize the network using state-of-the-art quantization methods. Preliminary experiments show that little accuracy is compromised and speedup can be achieved with our co-design optimization for the deformable convolution.
arXiv Detail & Related papers (2020-02-19T01:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.