Video Crowd Localization with Multi-focus Gaussian Neighbor Attention
and a Large-Scale Benchmark
- URL: http://arxiv.org/abs/2107.08645v2
- Date: Tue, 20 Jul 2021 01:46:08 GMT
- Title: Video Crowd Localization with Multi-focus Gaussian Neighbor Attention
and a Large-Scale Benchmark
- Authors: Haopeng Li, Lingbo Liu, Kunlin Yang, Shinan Liu, Junyu Gao, Bin Zhao,
Rui Zhang, Jun Hou
- Abstract summary: We develop a unified neural network called GNANet to accurately locate head centers in video clips.
To facilitate future researches in this field, we introduce a large-scale crowded video benchmark named SenseCrowd.
The proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.
- Score: 35.607604087583425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video crowd localization is a crucial yet challenging task, which aims to
estimate exact locations of human heads in the given crowded videos. To model
spatial-temporal dependencies of human mobility, we propose a multi-focus
Gaussian neighbor attention (GNA), which can effectively exploit long-range
correspondences while maintaining the spatial topological structure of the
input videos. In particular, our GNA can also capture the scale variation of
human heads well using the equipped multi-focus mechanism. Based on the
multi-focus GNA, we develop a unified neural network called GNANet to
accurately locate head centers in video clips by fully aggregating
spatial-temporal information via a scene modeling module and a context
cross-attention module. Moreover, to facilitate future researches in this
field, we introduce a large-scale crowded video benchmark named SenseCrowd,
which consists of 60K+ frames captured in various surveillance scenarios and
2M+ head annotations. Finally, we conduct extensive experiments on three
datasets including our SenseCrowd, and the experiment results show that the
proposed method is capable to achieve state-of-the-art performance for both
video crowd localization and counting. The code and the dataset will be
released.
Related papers
- ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers [9.271932084757646]
3D occupancy represents the entire scene without distinguishing between foreground and background by the physical space into a grid map.
We propose our learning-first view attention mechanism for effective multi-view feature aggregation.
We present FlowOcc3D, a benchmark built on top existing high-quality datasets.
arXiv Detail & Related papers (2024-05-07T13:15:07Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.