CounTR: Transformer-based Generalised Visual Counting
- URL: http://arxiv.org/abs/2208.13721v3
- Date: Fri, 2 Jun 2023 07:51:22 GMT
- Title: CounTR: Transformer-based Generalised Visual Counting
- Authors: Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
- Abstract summary: We develop a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars"
We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
- Score: 94.54725247039441
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of generalised visual object counting,
with the goal of developing a computational model for counting the number of
objects from arbitrary semantic categories, using arbitrary number of
"exemplars", i.e. zero-shot or few-shot counting. To this end, we make the
following four contributions: (1) We introduce a novel transformer-based
architecture for generalised visual object counting, termed as Counting
Transformer (CounTR), which explicitly capture the similarity between image
patches or with given "exemplars" with the attention mechanism;(2) We adopt a
two-stage training regime, that first pre-trains the model with self-supervised
learning, and followed by supervised fine-tuning;(3) We propose a simple,
scalable pipeline for synthesizing training images with a large number of
instances or that from different semantic categories, explicitly forcing the
model to make use of the given "exemplars";(4) We conduct thorough ablation
studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate
state-of-the-art performance on both zero and few-shot settings.
Related papers
- Causal Image Modeling for Efficient Visual Understanding [41.87857129429512]
We introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.
This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length.
In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework.
arXiv Detail & Related papers (2024-10-10T04:14:52Z) - CountGD: Multi-Modal Open-World Counting [54.88804890463491]
This paper aims to improve the generality and accuracy of open-vocabulary object counting in images.
We introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both.
arXiv Detail & Related papers (2024-07-05T16:20:48Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Counting Like Human: Anthropoid Crowd Counting on Modeling the
Similarity of Objects [92.80955339180119]
mainstream crowd counting methods regress density map and integrate it to obtain counting results.
Inspired by this, we propose a rational and anthropoid crowd counting framework.
arXiv Detail & Related papers (2022-12-02T07:00:53Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
We propose a learning-based approach to infer 3D shape and pose of object from a single image.
We first infer a volumetric representation in a canonical frame, along with the camera pose.
The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame.
arXiv Detail & Related papers (2021-02-11T18:57:10Z) - Sequential View Synthesis with Transformer [13.200139959163574]
We introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations.
We evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn't require any retraining for finetuning.
arXiv Detail & Related papers (2020-04-09T14:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.