FusionCount: Efficient Crowd Counting via Multiscale Feature Fusion
- URL: http://arxiv.org/abs/2202.13660v1
- Date: Mon, 28 Feb 2022 10:04:07 GMT
- Title: FusionCount: Efficient Crowd Counting via Multiscale Feature Fusion
- Authors: Yiming Ma, Victor Sanchez and Tanaya Guha
- Abstract summary: This paper proposes a novel crowd counting architecture (FusionCount)
It exploits the adaptive fusion of a large majority of encoded features instead of relying on additional extraction components to obtain multiscale features.
Experiments on two benchmark databases demonstrate that our model achieves state-of-the-art results with reduced computational complexity.
- Score: 36.15554768378944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art crowd counting models follow an encoder-decoder approach.
Images are first processed by the encoder to extract features. Then, to account
for perspective distortion, the highest-level feature map is fed to extra
components to extract multiscale features, which are the input to the decoder
to generate crowd densities. However, in these methods, features extracted at
earlier stages during encoding are underutilised, and the multiscale modules
can only capture a limited range of receptive fields, albeit with considerable
computational cost. This paper proposes a novel crowd counting architecture
(FusionCount), which exploits the adaptive fusion of a large majority of
encoded features instead of relying on additional extraction components to
obtain multiscale features. Thus, it can cover a more extensive scope of
receptive field sizes and lower the computational cost. We also introduce a new
channel reduction block, which can extract saliency information during decoding
and further enhance the model's performance. Experiments on two benchmark
databases demonstrate that our model achieves state-of-the-art results with
reduced computational complexity.
Related papers
- Optimizing Medical Image Segmentation with Advanced Decoder Design [0.8402155549849591]
U-Net is widely used in medical image segmentation due to its simple and flexible architecture design.
We propose Swin DER (i.e., Swin UNETR Decoder Enhanced and Refined) by specifically optimizing the design of these three components.
Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and the MSD brain tumor segmentation task.
arXiv Detail & Related papers (2024-10-05T11:47:13Z) - Few-Shot Medical Image Segmentation with Large Kernel Attention [5.630842216128902]
We propose a few-shot medical segmentation model that acquire comprehensive feature representation capabilities.
Our model comprises four key modules: a dual-path feature extractor, an attention module, an adaptive prototype prediction module, and a multi-scale prediction fusion module.
The results demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-07-27T02:28:30Z) - DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut [62.63481844384229]
Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks.
In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method.
Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks.
arXiv Detail & Related papers (2024-06-05T01:32:31Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - More complex encoder is not all you need [0.882348769487259]
We introduce neU-Net (i.e., not complex encoder U-Net), which incorporates a novel Sub-pixel Convolution for upsampling to construct a powerful decoder.
Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and ACDC datasets.
arXiv Detail & Related papers (2023-09-20T08:34:38Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - General Purpose Text Embeddings from Pre-trained Language Models for
Scalable Inference [34.47592026375839]
We show that some of the computational cost during inference can be amortized over the different tasks using a shared text encoder.
We also compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks.
arXiv Detail & Related papers (2020-04-29T16:11:26Z) - Encoder-Decoder Based Convolutional Neural Networks with
Multi-Scale-Aware Modules for Crowd Counting [6.893512627479196]
We propose two modified neural networks for accurate and efficient crowd counting.
The first model is named M-SFANet, which is attached with atrous spatial pyramid pooling (ASPP) and context-aware module (CAN)
The second model is called M-SegNet, which is produced by replacing the bilinear upsampling in SFANet with max unpooling that is used in SegNet.
arXiv Detail & Related papers (2020-03-12T03:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.