Video Individual Counting With Implicit One-to-Many Matching
- URL: http://arxiv.org/abs/2506.13067v1
- Date: Mon, 16 Jun 2025 03:20:00 GMT
- Title: Video Individual Counting With Implicit One-to-Many Matching
- Authors: Xuhui Zhu, Jing Xu, Bingjie Wang, Huikang Dai, Hao Lu,
- Abstract summary: Video Individual Counting aims to estimate pedestrian flux from a video.<n>Key problem of VIC is how to identify co-existent pedestrians between frames.<n>We introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg.
- Score: 8.80200994828351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Individual Counting (VIC) is a recently introduced task that aims to estimate pedestrian flux from a video. It extends conventional Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that only learns to count repeated pedestrian patterns across frames, the key problem of VIC is how to identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, mainly follow a one-to-one (O2O) matching strategy where the same pedestrian must be exactly matched between frames, leading to sensitivity to appearance variations or missing detections. In this work, we show that the O2O matching could be relaxed to a one-to-many (O2M) matching problem, which better fits the problem nature of VIC and can leverage the social grouping behavior of walking pedestrians. We therefore introduce OMAN, a simple but effective VIC model with implicit One-to-Many mAtchiNg, featuring an implicit context generator and a one-to-many pairwise matcher. Experiments on the SenseCrowd and CroHD benchmarks show that OMAN achieves the state-of-the-art performance. Code is available at \href{https://github.com/tiny-smart/OMAN}{OMAN}.
Related papers
- Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes [3.2416801263793285]
We propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations.<n>Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task.<n>Our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches.
arXiv Detail & Related papers (2025-03-17T21:48:56Z) - Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network [57.72095897427665]
temporal sentence grounding (TSG) aims to locate query-relevant segments in videos.<n>Previous methods follow a single-thread framework that cannot co-train different pairs.<n>We propose Multi-Pair TSG, which aims to co-train these pairs.
arXiv Detail & Related papers (2024-12-20T08:50:11Z) - VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Weakly Supervised Video Individual CountingWeakly Supervised Video
Individual Counting [126.75545291243142]
Video Individual Counting aims to predict the number of unique individuals in a single video.
We introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining.
arXiv Detail & Related papers (2023-12-10T16:12:13Z) - Robust Multi-Object Tracking by Marginal Inference [92.48078680697311]
Multi-object tracking in videos requires to solve a fundamental problem of one-to-one assignment between objects in adjacent frames.
We present an efficient approach to compute a marginal probability for each pair of objects in real time.
It achieves competitive results on MOT17 and MOT20 benchmarks.
arXiv Detail & Related papers (2022-08-07T14:04:45Z) - CGUA: Context-Guided and Unpaired-Assisted Weakly Supervised Person
Search [54.106662998673514]
We introduce a Context-Guided and Unpaired-Assisted (CGUA) weakly supervised person search framework.
Specifically, we propose a novel Context-Guided Cluster (CGC) algorithm to leverage context information in the clustering process.
Our method achieves comparable or better performance to the state-of-the-art supervised methods by leveraging more diverse unlabeled data.
arXiv Detail & Related papers (2022-03-27T13:57:30Z) - DR.VIC: Decomposition and Reasoning for Video Individual Counting [93.12166351940242]
We propose to conduct pedestrian counting from a new perspective - Video Individual Counting (VIC)
Instead of relying on the Multiple Object Tracking (MOT) techniques, we propose to solve the problem by decomposing all pedestrians into the initial pedestrians who existed in the first frame and the new pedestrians with separate identities in each following frame.
An end-to-end Decomposition and Reasoning Network (DRNet) is designed to predict the initial pedestrian count with the density estimation method and reason the new pedestrian's count of each frame with the differentiable optimal transport.
arXiv Detail & Related papers (2022-03-23T11:24:44Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Few-Shot Action Recognition with Compromised Metric via Optimal
Transport [31.834843714684343]
Few-shot action recognition is still not mature despite the wide research of few-shot image classification.
One main obstacle to applying these algorithms in action recognition is the complex structure of videos.
We propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions.
arXiv Detail & Related papers (2021-04-08T12:42:05Z) - CycAs: Self-supervised Cycle Association for Learning Re-identifiable
Descriptions [61.724894233252414]
This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem.
Existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering.
We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels.
arXiv Detail & Related papers (2020-07-15T09:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.