Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting
- URL: http://arxiv.org/abs/2602.12774v1
- Date: Fri, 13 Feb 2026 09:58:35 GMT
- Title: Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting
- Authors: Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Qijun Chen, Miaojing Shi,
- Abstract summary: We propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting.<n> WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.
- Score: 59.37613121962146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.
Related papers
- SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection [19.35154888756369]
A consistent trend throughout the research of oriented object detection has been the pursuit of maintaining comparable performance with fewer and weaker annotations.<n>This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs.<n>We introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection framework, designed to efficiently leverage only a few weakly-labeled data and plenty of unlabeled data.
arXiv Detail & Related papers (2026-02-03T15:21:01Z) - LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering [52.41664454251679]
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering.<n>Existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach.<n>We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task.
arXiv Detail & Related papers (2025-11-19T13:22:08Z) - LLM-guided Hierarchical Retrieval [54.73080745446999]
LATTICE is a hierarchical retrieval framework that enables an LLM to reason over and navigate large corpora with logarithmic search complexity.<n>A central challenge in such LLM-guided search is that the model's relevance judgments are noisy, context-dependent, and unaware of the hierarchy.<n>Our framework achieves state-of-the-art zero-shot performance on the reasoning-intensive BRIGHT benchmark.
arXiv Detail & Related papers (2025-10-15T07:05:17Z) - SkillAggregation: Reference-free LLM-Dependent Aggregation [14.46141987797362]
Large Language Models (LLMs) are increasingly used to assess NLP tasks.
Recent work suggests using multiple LLMs as judges yields improved performance.
This work focuses on aggregating predictions from multiple systems where no reference labels are available.
arXiv Detail & Related papers (2024-10-14T07:13:47Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Point, Segment and Count: A Generalized Framework for Object Counting [40.192374437785155]
Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names.
We propose a generalized framework for both few-shot and zero-shot object counting based on detection.
PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection.
arXiv Detail & Related papers (2023-11-21T06:55:21Z) - SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting [67.97870844244187]
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image.<n>We propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (Net)<n>It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size.
arXiv Detail & Related papers (2023-11-16T16:50:56Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Learning from Counting: Leveraging Temporal Classification for Weakly
Supervised Object Localization and Detection [4.971083368517706]
We introduce scan-order techniques to serialize 2D images into 1D sequence data.
We then leverage a combined LSTM (Long, Short-Term Memory) and CTC network to achieve object localization.
arXiv Detail & Related papers (2021-03-06T02:18:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.