Related papers: Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

URL: http://arxiv.org/abs/2601.01192v1
Date: Sat, 03 Jan 2026 14:24:12 GMT
Title: Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors
Authors: Hao Lu, Xuhui Zhu, Wenjing Zhang, Yanan Li, Xiang Bai,
Abstract summary: Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video.<n>Existing VIC approaches, however, can underperform in congested scenes such as metro commuting.<n>We build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded pedestrian flows.<n>OMAN++ outperforms state-of-the-art VIC baselines on SenseCrowd, CroHD, and MovingCrowd benchmarks.
Score: 48.01681141887943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.

Related papers

Video Individual Counting With Implicit One-to-Many Matching [8.80200994828351]
Video Individual Counting aims to estimate pedestrian flux from a video.<n>Key problem of VIC is how to identify co-existent pedestrians between frames.<n>We introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg.
arXiv Detail & Related papers (2025-06-16T03:20:00Z)
Video Individual Counting for Moving Drones [51.429771128144964]
Video Individual Counting (VIC) has received increasing attention for its importance in intelligent video surveillance.<n>Previous datasets are captured with fixed or rarely moving cameras with relatively sparse individuals, restricting evaluation for a highly varying view and time in crowded scenes.<n>To address these issues, we introduce the MovingDroneCrowd dataset, featuring videos captured by fast-moving drones in crowded scenes under diverse illuminations, shooting heights and angles.
arXiv Detail & Related papers (2025-03-12T07:09:33Z)
Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes [70.08318779492944]
We are the first to harness vanishing point (VP) priors for more effective segmentation. Our novel, efficient network for VSS, named VPSeg, incorporates two modules that utilize exactly this pair of static and dynamic VP priors.
arXiv Detail & Related papers (2024-01-27T01:01:58Z)
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes [78.95447086305381]
Accurately detecting and tracking pedestrians in 3D space is challenging due to large variations in rotations, poses and scales. Existing benchmarks either only provide 2D annotations, or have limited 3D annotations with low-density pedestrian distribution. We introduce a large-scale multimodal dataset, STCrowd, to better evaluate pedestrian perception algorithms in crowded scenarios.
arXiv Detail & Related papers (2022-04-03T08:26:07Z)
DR.VIC: Decomposition and Reasoning for Video Individual Counting [93.12166351940242]
We propose to conduct pedestrian counting from a new perspective - Video Individual Counting (VIC) Instead of relying on the Multiple Object Tracking (MOT) techniques, we propose to solve the problem by decomposing all pedestrians into the initial pedestrians who existed in the first frame and the new pedestrians with separate identities in each following frame. An end-to-end Decomposition and Reasoning Network (DRNet) is designed to predict the initial pedestrian count with the density estimation method and reason the new pedestrian's count of each frame with the differentiable optimal transport.
arXiv Detail & Related papers (2022-03-23T11:24:44Z)
Pedestrian Stop and Go Forecasting with Hybrid Feature Fusion [87.77727495366702]
We introduce the new task of pedestrian stop and go forecasting. Considering the lack of suitable existing datasets for it, we release TRANS, a benchmark for explicitly studying the stop and go behaviors of pedestrians in urban traffic. We build it from several existing datasets annotated with pedestrians' walking motions, in order to have various scenarios and behaviors.
arXiv Detail & Related papers (2022-03-04T18:39:31Z)
Safety-Oriented Pedestrian Motion and Scene Occupancy Forecasting [91.69900691029908]
We advocate for predicting both the individual motions as well as the scene occupancy map. We propose a Scene-Actor Graph Neural Network (SA-GNN) which preserves the relative spatial information of pedestrians. On two large-scale real-world datasets, we showcase that our scene-occupancy predictions are more accurate and better calibrated than those from state-of-the-art motion forecasting methods.
arXiv Detail & Related papers (2021-01-07T06:08:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.