Active View Selection for Scene-level Multi-view Crowd Counting and Localization with Limited Labels
- URL: http://arxiv.org/abs/2509.16684v1
- Date: Sat, 20 Sep 2025 13:23:46 GMT
- Title: Active View Selection for Scene-level Multi-view Crowd Counting and Localization with Limited Labels
- Authors: Qi Zhang, Bin Li, Antoni B. Chan, Hui Huang,
- Abstract summary: Multi-view crowd counting and localization fuse the input multi-views for estimating the crowd number or locations on the ground.<n>Existing methods require massive labeled views and images, and lack the ability for cross-scene settings.<n>We propose an independent view selection method (IVS) that considers view and scene geometries in the view selection strategy.<n>We also put forward an active view selection method (AVS) that jointly optimize the view selection, labeling, and downstream tasks.
- Score: 55.396639405563526
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-view crowd counting and localization fuse the input multi-views for estimating the crowd number or locations on the ground. Existing methods mainly focus on accurately predicting on the crowd shown in the input views, which neglects the problem of choosing the `best' camera views to perceive all crowds well in the scene. Besides, existing view selection methods require massive labeled views and images, and lack the ability for cross-scene settings, reducing their application scenarios. Thus, in this paper, we study the view selection issue for better scene-level multi-view crowd counting and localization results with cross-scene ability and limited label demand, instead of input-view-level results. We first propose an independent view selection method (IVS) that considers view and scene geometries in the view selection strategy and conducts the view selection, labeling, and downstream tasks independently. Based on IVS, we also put forward an active view selection method (AVS) that jointly optimizes the view selection, labeling, and downstream tasks. In AVS, we actively select the labeled views and consider both the view/scene geometries and the predictions of the downstream task models in the view selection process. Experiments on multi-view counting and localization tasks demonstrate the cross-scene and the limited label demand advantages of the proposed active view selection method (AVS), outperforming existing methods and with wider application scenarios.
Related papers
- Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos [71.01549400773197]
We introduce SWITCH-A-VIEW, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video.<n>We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint.<n>We then discover the patterns between the visual and spoken content in a how-to video on the one hand and its view-switch moments on the other hand.
arXiv Detail & Related papers (2024-12-24T12:16:43Z) - Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos [66.1935609072708]
LangView is a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.<n>During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - Improving Viewpoint-Independent Object-Centric Representations through Active Viewpoint Selection [41.419853273742746]
We propose a novel active viewpoint selection strategy for object-centric learning.
It predicts images from unknown viewpoints based on information from observation images for each scene.
Our method can accurately predict images from unknown viewpoints.
arXiv Detail & Related papers (2024-11-01T07:01:44Z) - Action Selection Learning for Multi-label Multi-view Action Recognition [2.8266810371534152]
This study focuses on real-world scenarios where cameras are distributed to capture a wide-range area with only weak labels available at the video-level.
We propose the method named Multi-view Action Selection Learning (MultiASL), which leverages action selection learning to enhance view fusion.
Experiments in a real-world office environment using the MM-Office dataset demonstrate the superior performance of the proposed method compared to existing methods.
arXiv Detail & Related papers (2024-10-04T10:36:22Z) - Learning to Select Camera Views: Efficient Multiview Understanding at
Few Glances [59.34619548026885]
We propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing.
Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network.
arXiv Detail & Related papers (2023-03-10T18:59:10Z) - Cross-View Cross-Scene Multi-View Crowd Counting [56.83882084112913]
Multi-view crowd counting has been previously proposed to utilize multi-cameras to extend the field-of-view of a single camera.
We propose a cross-view cross-scene (CVCS) multi-view crowd counting paradigm, where the training and testing occur on different scenes with arbitrary camera layouts.
arXiv Detail & Related papers (2022-05-03T15:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.