Related papers: 3D Crowd Counting via Multi-View Fusion with 3D Gaussian Kernels

3D Crowd Counting via Multi-View Fusion with 3D Gaussian Kernels

URL: http://arxiv.org/abs/2003.08162v1
Date: Wed, 18 Mar 2020 11:35:11 GMT
Title: 3D Crowd Counting via Multi-View Fusion with 3D Gaussian Kernels
Authors: Qi Zhang and Antoni B. Chan
Abstract summary: We propose an end-to-end multi-view crowd counting method called multi-view multi-scale (MVMS) Unlike MVMS, we propose to solve the crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D ground-plane ones. The proposed method is tested on 3 multi-view counting datasets and achieves better or comparable counting performance to the state-of-the-art.
Score: 56.964614522968226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Crowd counting has been studied for decades and a lot of works have achieved good performance, especially the DNNs-based density map estimation methods. Most existing crowd counting works focus on single-view counting, while few works have studied multi-view counting for large and wide scenes, where multiple cameras are used. Recently, an end-to-end multi-view crowd counting method called multi-view multi-scale (MVMS) has been proposed, which fuses multiple camera views using a CNN to predict a 2D scene-level density map on the ground-plane. Unlike MVMS, we propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D ground-plane ones. Compared to 2D fusion, the 3D fusion extracts more information of the people along z-dimension (height), which helps to solve the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. We also explore the projection consistency among the 3D prediction and the ground-truth in the 2D views to further enhance the counting performance. The proposed method is tested on 3 multi-view counting datasets and achieves better or comparable counting performance to the state-of-the-art.

Related papers

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness [73.72335146374543]
We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-04-02T16:59:55Z)
Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction [28.071645239063553]
We present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames.
arXiv Detail & Related papers (2024-09-12T12:12:19Z)
Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner [19.908670991088556]
We introduce a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features.<n>A novel two-stage self-training strategy is proposed to align 2D and 3D representations.<n>Our method outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection.
arXiv Detail & Related papers (2023-11-17T22:10:03Z)
SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving. We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z)
Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation [3.5939555573102853]
Recent works on 3D semantic segmentation propose to exploit the synergy between images and point clouds by processing each modality with a dedicated network. We propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions. Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks.
arXiv Detail & Related papers (2022-04-15T17:10:48Z)
Gait Recognition in the Wild with Dense 3D Representations and A Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. This paper aims to explore dense 3D representations for gait recognition in the wild. We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z)
From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection [101.20784125067559]
We propose a new architecture, namely Hallucinated Hollow-3D R-CNN, to address the problem of 3D object detection. In our approach, we first extract the multi-view features by sequentially projecting the point clouds into the perspective view and the bird-eye view. The 3D objects are detected via a box refinement module with a novel Hierarchical Voxel RoI Pooling operation.
arXiv Detail & Related papers (2021-07-30T02:00:06Z)
Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z)
3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images. First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training. Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration. Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space. A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space. We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z)
Virtual Multi-view Fusion for 3D Semantic Segmentation [11.259694096475766]
We show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results.
arXiv Detail & Related papers (2020-07-26T14:46:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.