Rethinking Global Context in Crowd Counting
- URL: http://arxiv.org/abs/2105.10926v2
- Date: Sat, 25 Nov 2023 18:07:15 GMT
- Title: Rethinking Global Context in Crowd Counting
- Authors: Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel, Nikola Popovic,
Luc Van Gool
- Abstract summary: A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
- Score: 70.54184500538338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the role of global context for crowd counting.
Specifically, a pure transformer is used to extract features with global
information from overlapping image patches. Inspired by classification, we add
a context token to the input sequence, to facilitate information exchange with
tokens corresponding to image patches throughout transformer layers. Due to the
fact that transformers do not explicitly model the tried-and-true channel-wise
interactions, we propose a token-attention module (TAM) to recalibrate encoded
features through channel-wise attention informed by the context token. Beyond
that, it is adopted to predict the total person count of the image through
regression-token module (RTM). Extensive experiments on various datasets,
including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU, demonstrate that the
proposed context extraction techniques can significantly improve the
performance over the baselines.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer [29.95553680263075]
We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
arXiv Detail & Related papers (2023-10-20T15:54:18Z) - Locality-Aware Generalizable Implicit Neural Representation [54.93702310461174]
Generalizable implicit neural representation (INR) enables a single continuous function to represent multiple data instances.
We propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder.
Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks.
arXiv Detail & Related papers (2023-10-09T11:26:58Z) - RFR-WWANet: Weighted Window Attention-Based Recovery Feature Resolution
Network for Unsupervised Image Registration [7.446209993071451]
The Swin transformer has attracted attention in medical image analysis due to its computational efficiency and long-range modeling capability.
The registration models based on transformers combine multiple voxels into a single semantic token.
This merging process limits the transformers to model and generate coarse-grained spatial information.
We propose Recovery Feature Resolution Network (RFRNet), which allows the transformer to contribute fine-grained spatial information.
arXiv Detail & Related papers (2023-05-07T09:57:29Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - SUMD: Super U-shaped Matrix Decomposition Convolutional neural network
for Image denoising [0.0]
We introduce the matrix decomposition module(MD) in the network to establish the global context feature.
Inspired by the design of multi-stage progressive restoration of U-shaped architecture, we further integrate the MD module into the multi-branches.
Our model(SUMD) can produce comparable visual quality and accuracy results with Transformer-based methods.
arXiv Detail & Related papers (2022-04-11T04:38:34Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - CPTR: Full Transformer Network for Image Captioning [15.869556479220984]
CaPtion TransformeR (CPTR) takes the sequentialized raw images as the input to Transformer.
Compared to the "CNN+Transformer" design paradigm, our model can model global context at every encoder layer from the beginning.
arXiv Detail & Related papers (2021-01-26T14:29:52Z) - Improving Image Captioning by Leveraging Intra- and Inter-layer Global
Representation in Transformer Network [96.4761273757796]
We introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation.
GET adaptively guides the decoder to generate high-quality captions.
arXiv Detail & Related papers (2020-12-13T13:38:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.