Related papers: TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

URL: http://arxiv.org/abs/2512.22203v1
Date: Sun, 21 Dec 2025 10:37:00 GMT
Title: TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting
Authors: Qiang Guo, Rubo Zhang, Bingbing Zhang, Junjie Liu, Jianqing Liu,
Abstract summary: TCFormer is a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance.<n>To compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module.<n>This paper introduces a density-level classification loss, which discretizes crowd density into distinct grades.
Score: 13.816243638358408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model's classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.

Related papers

Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z)
Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification [3.6907522136316975]
Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking.<n>We explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps.<n>We propose a novel PEFT strategy termed Domain Representation Injection (DRI)
arXiv Detail & Related papers (2025-12-24T02:30:23Z)
CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects [2.321156185872456]
We propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization.<n>First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion.<n>Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions.
arXiv Detail & Related papers (2025-06-11T16:13:38Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification [42.14439854721613]
We propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences.<n>Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.
arXiv Detail & Related papers (2025-04-14T13:38:00Z)
A feature refinement module for light-weight semantic segmentation network [11.285793559719702]
This paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light-weight network.<n>On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade-off between accuracy and computational cost.
arXiv Detail & Related papers (2024-12-11T03:31:20Z)
HAFLQ: Heterogeneous Adaptive Federated LoRA Fine-tuned LLM with Quantization [55.972018549438964]
Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving privacy.<n>We propose HAFLQ (Heterogeneous Adaptive Federated Low-Rank Adaptation Fine-tuned LLM with Quantization), a novel framework for efficient and scalable fine-tuning of LLMs in heterogeneous environments.<n> Experimental results on the text classification task demonstrate that HAFLQ reduces memory usage by 31%, lowers communication cost by 49%, improves accuracy by 50%, and achieves faster convergence compared to the baseline method.
arXiv Detail & Related papers (2024-11-10T19:59:54Z)
Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation [51.66997548477913]
We propose a novel feature-level consistency learning framework named Density-Descending Feature Perturbation (DDFP) Inspired by the low-density separation assumption in semi-supervised learning, our key insight is that feature density can shed a light on the most promising direction for the segmentation classifier to explore. The proposed DDFP outperforms other designs on feature-level perturbations and shows state of the art performances on both Pascal VOC and Cityscapes dataset.
arXiv Detail & Related papers (2024-03-11T06:59:05Z)
Semi-supervised Crowd Counting via Density Agency [57.3635501421658]
We build a learnable auxiliary structure, namely the density agency to bring the recognized foreground regional features close to corresponding density sub-classes. Second, we propose a density-guided contrastive learning loss to consolidate the backbone feature extractor. Third, we build a regression head by using a transformer structure to refine the foreground features further.
arXiv Detail & Related papers (2022-09-07T06:34:00Z)
Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF) We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results. In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z)
PSCNet: Pyramidal Scale and Global Context Guided Network for Crowd Counting [44.306790250158954]
This paper proposes a novel crowd counting approach based on pyramidal scale module (PSM) and global context module (GCM) PSM is used to adaptively capture multi-scale information, which can identify a fine boundary of crowds with different image scales. GCM is devised with low-complexity and lightweight manner, to make the interactive information across the channels of the feature maps more efficient.
arXiv Detail & Related papers (2020-12-07T11:35:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.