A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in
Aerial View
- URL: http://arxiv.org/abs/2009.13723v1
- Date: Tue, 29 Sep 2020 01:48:24 GMT
- Title: A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in
Aerial View
- Authors: Zhiyuan Zhao, Tao Han, Junyu Gao, Qi Wang, Xuelong Li
- Abstract summary: In this paper, we strive to tackle the challenges and automatically understand the crowd from the visual data collected from drones.
To alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed.
To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by game Grand Theft Auto V(GTAV)
- Score: 93.23947591795897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Drones shooting can be applied in dynamic traffic monitoring, object
detecting and tracking, and other vision tasks. The variability of the shooting
location adds some intractable challenges to these missions, such as varying
scale, unstable exposure, and scene migration. In this paper, we strive to
tackle the above challenges and automatically understand the crowd from the
visual data collected from drones. First, to alleviate the background noise
generated in cross-scene testing, a double-stream crowd counting model is
proposed, which extracts optical flow and frame difference information as an
additional branch. Besides, to improve the model's generalization ability at
different scales and time, we randomly combine a variety of data transformation
methods to simulate some unseen environments. To tackle the crowd density
estimation problem under extreme dark environments, we introduce synthetic data
generated by game Grand Theft Auto V(GTAV). Experiment results show the
effectiveness of the virtual data. Our method wins the challenge with a mean
absolute error (MAE) of 12.70. Moreover, a comprehensive ablation study is
conducted to explore each component's contribution.
Related papers
- SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior [53.52396082006044]
Current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints.
This issue stems from the sparse training views captured by a fixed camera on a moving vehicle.
We propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model.
arXiv Detail & Related papers (2024-03-29T09:20:29Z) - Amirkabir campus dataset: Real-world challenges and scenarios of Visual
Inertial Odometry (VIO) for visually impaired people [3.7998592843098336]
We introduce the Amirkabir campus dataset (AUT-VI) to address the mentioned problem and improve the navigation systems.
AUT-VI is a novel and super-challenging dataset with 126 diverse sequences in 17 different locations.
In support of ongoing development efforts, we have released the Android application for data capture to the public.
arXiv Detail & Related papers (2024-01-07T23:13:51Z) - SeaDSC: A video-based unsupervised method for dynamic scene change
detection in unmanned surface vehicles [3.2716252389196288]
This paper outlines our approach to detect dynamic scene changes in Unmanned Surface Vehicles (USVs)
Our objective is to identify significant changes in the dynamic scenes of maritime video data, particularly those scenes that exhibit a high degree of resemblance.
In our system for dynamic scene change detection, we propose completely unsupervised learning method.
arXiv Detail & Related papers (2023-11-20T07:34:01Z) - Towards Viewpoint Robustness in Bird's Eye View Segmentation [85.99907496019972]
We study how AV perception models are affected by changes in camera viewpoint.
Small changes to pitch, yaw, depth, or height of the camera at inference time lead to large drops in performance.
We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs.
arXiv Detail & Related papers (2023-09-11T02:10:07Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural
Rendering [83.75284107397003]
We introduce ScatterNeRF, a neural rendering method which renders scenes and decomposes the fog-free background.
We propose a disentangled representation for the scattering volume and the scene objects, and learn the scene reconstruction with physics-inspired losses.
We validate our method by capturing multi-view In-the-Wild data and controlled captures in a large-scale fog chamber.
arXiv Detail & Related papers (2023-05-03T13:24:06Z) - SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking [12.447854608181833]
This work presents a novel saliency-guided dynamic vision Transformer (SGDViT) for UAV tracking.
The proposed method designs a new task-specific object saliency mining network to refine the cross-correlation operation.
A lightweight saliency filtering Transformer further refines saliency information and increases the focus on appearance information.
arXiv Detail & Related papers (2023-03-08T05:01:00Z) - UAV-CROWD: Violent and non-violent crowd activity simulator from the
perspective of UAV [0.0]
Video datasets that capture violent and non-violent human activity from aerial point-of-view are scarce.
We propose a novel, baseline simulator which is capable of generating synthetic images of crowds engaging in various activities that can be categorized as violent or non-violent.
Our simulator is capable of generating large, randomized urban environments and is able to maintain an average of 25 frames per second on a mid-range computer.
arXiv Detail & Related papers (2022-08-13T18:28:37Z) - Vision-Language Navigation with Random Environmental Mixup [112.94609558723518]
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction.
Previous works have proposed various data augmentation methods to reduce data bias.
We propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment.
arXiv Detail & Related papers (2021-06-15T04:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.