Motion-guided Non-local Spatial-Temporal Network for Video Crowd
Counting
- URL: http://arxiv.org/abs/2104.13946v1
- Date: Wed, 28 Apr 2021 18:05:13 GMT
- Title: Motion-guided Non-local Spatial-Temporal Network for Video Crowd
Counting
- Authors: Haoyue Bai, S.-H. Gary Chan
- Abstract summary: We study video crowd counting, which is to estimate the number of objects in all the frames of a video sequence.
We propose Monet, a motion-guided non-local spatial-temporal network for video crowd counting.
Our approach achieves substantially better performance in terms of MAE and MSE as compared with other state-of-the-art approaches.
- Score: 2.3732259124656903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study video crowd counting, which is to estimate the number of objects
(people in this paper) in all the frames of a video sequence. Previous work on
crowd counting is mostly on still images. There has been little work on how to
properly extract and take advantage of the spatial-temporal correlation between
neighboring frames in both short and long ranges to achieve high estimation
accuracy for a video sequence. In this work, we propose Monet, a novel and
highly accurate motion-guided non-local spatial-temporal network for video
crowd counting. Monet first takes people flow (motion information) as guidance
to coarsely segment the regions of pixels where a person may be. Given these
regions, Monet then uses a non-local spatial-temporal network to extract
spatial-temporally both short and long-range contextual information. The whole
network is finally trained end-to-end with a fused loss to generate a
high-quality density map. Noting the scarcity and low quality (in terms of
resolution and scene diversity) of the publicly available video crowd datasets,
we have collected and built a large-scale video crowd counting datasets,
VidCrowd, to contribute to the community. VidCrowd contains 9,000 frames of
high resolution (2560 x 1440), with 1,150,239 head annotations captured in
different scenes, crowd density and lighting in two cities. We have conducted
extensive experiments on the challenging VideoCrowd and two public video crowd
counting datasets: UCSD and Mall. Our approach achieves substantially better
performance in terms of MAE and MSE as compared with other state-of-the-art
approaches.
Related papers
- Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video
Localization [85.85582751254785]
We present a novel approach to NLVL that aims to address this issue.
Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process.
Our approach effectively encapsulates the interaction between the query and video data across various time scales.
arXiv Detail & Related papers (2024-01-16T09:33:29Z) - Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - NPF-200: A Multi-Modal Eye Fixation Dataset and Method for
Non-Photorealistic Videos [51.409547544747284]
NPF-200 is the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations.
We conduct a series of analyses to gain deeper insights into this task.
We propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet.
arXiv Detail & Related papers (2023-08-23T14:25:22Z) - PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point
Tracking [90.29143475328506]
We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework.
Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion.
We animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos.
arXiv Detail & Related papers (2023-07-27T17:58:11Z) - Video Crowd Localization with Multi-focus Gaussian Neighbor Attention
and a Large-Scale Benchmark [35.607604087583425]
We develop a unified neural network called GNANet to accurately locate head centers in video clips.
To facilitate future researches in this field, we introduce a large-scale crowded video benchmark named SenseCrowd.
The proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.
arXiv Detail & Related papers (2021-07-19T06:59:27Z) - Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in
Large Scenes [50.744452135300115]
We propose a deep neural network framework for multi-view crowd counting.
Our methods achieve state-of-the-art results compared to other multi-view counting baselines.
arXiv Detail & Related papers (2020-12-02T03:20:30Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z) - DVI: Depth Guided Video Inpainting for Autonomous Driving [35.94330601020169]
We present an automatic video inpainting algorithm that can remove traffic agents from videos.
By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated.
We are the first to fuse multiple videos for video inpainting.
arXiv Detail & Related papers (2020-07-17T09:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.