Related papers: CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

URL: http://arxiv.org/abs/2510.23785v1
Date: Mon, 27 Oct 2025 19:16:02 GMT
Title: CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting
Authors: Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen,
Abstract summary: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity.<n>In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

Related papers

Communication-Inspired Tokenization for Structured Image Representations [74.17163003465537]
COMmunication inspired Tokenization (COMiT) is a framework for learning structured discrete visual token sequences.<n>Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure.
arXiv Detail & Related papers (2026-02-24T09:53:50Z)
Factuality Matters: When Image Generation and Editing Meet Structured Visuals [46.627460447235855]
We construct a large-scale dataset of 1.3 million high-quality structured image pairs.<n>We train a unified model that integrates a VLM with FLUX.1 Kontext.<n>A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation.
arXiv Detail & Related papers (2025-10-06T17:56:55Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image [52.11275397911693]
We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image.<n>We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts.<n>Our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.
arXiv Detail & Related papers (2025-04-04T05:08:04Z)
The Effectiveness of a Simplified Model Structure for Crowd Counting [11.640020969258101]
This paper discusses how to construct high-performance crowd counting models using only simple structures. We propose the Fuss-Free Network (FFNet) that is characterized by its simple and efficieny structure, consisting of only a backbone network and a multi-scale feature fusion structure. Our proposed crowd counting model is trained and evaluated on four widely used public datasets, and it achieves accuracy that is comparable to that of existing complex models.
arXiv Detail & Related papers (2024-04-11T15:42:53Z)
CounTR: Transformer-based Generalised Visual Counting [94.54725247039441]
We develop a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars" We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
arXiv Detail & Related papers (2022-08-29T17:02:45Z)
Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder. We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets. We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z)
SetVAE: Learning Hierarchical Composition for Generative Modeling of Set-Structured Data [27.274328701618]
We propose SetVAE, a hierarchical variational autoencoder for sets. Motivated by recent progress in set encoding, we build SetVAE upon attentive modules that first partition the set and project the partition back to the original cardinality. We demonstrate that our model generalizes to unseen set sizes and learns interesting subset relations without supervision.
arXiv Detail & Related papers (2021-03-29T14:01:18Z)
Look-into-Object: Self-supervised Structure Modeling for Object Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions. We show the recognition backbone can be substantially enhanced for more robust representation learning. Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.