Related papers: PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments

PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments

URL: http://arxiv.org/abs/2502.15342v3
Date: Wed, 26 Feb 2025 11:11:45 GMT
Title: PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments
Authors: Yueting Liu, Hanshi Wang, Zhengjun Zha, Weiming Hu, Jin Gao,
Abstract summary: We present the multi-modal Pedestrian-Focused Scene dataset, rigorously annotated in semi-structured scenes with the format of nuScenes.<n>We also propose a novel Hybrid Multi-Scale Fusion Network (HMFN) to detect pedestrians in densely populated and occluded scenarios.
Score: 73.80718037070773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in autonomous driving perception have revealed exceptional capabilities within structured environments dominated by vehicular traffic. However, current perception models exhibit significant limitations in semi-structured environments, where dynamic pedestrians with more diverse irregular movement and occlusion prevail. We attribute this shortcoming to the scarcity of high-quality datasets in semi-structured scenes, particularly concerning pedestrian perception and prediction. In this work, we present the multi-modal Pedestrian-Focused Scene Dataset(PFSD), rigorously annotated in semi-structured scenes with the format of nuScenes. PFSD provides comprehensive multi-modal data annotations with point cloud segmentation, detection, and object IDs for tracking. It encompasses over 130,000 pedestrian instances captured across various scenarios with varying densities, movement patterns, and occlusions. Furthermore, to demonstrate the importance of addressing the challenges posed by more diverse and complex semi-structured environments, we propose a novel Hybrid Multi-Scale Fusion Network (HMFN). Specifically, to detect pedestrians in densely populated and occluded scenarios, our method effectively captures and fuses multi-scale features using a meticulously designed hybrid framework that integrates sparse and vanilla convolutions. Extensive experiments on PFSD demonstrate that HMFN attains improvement in mean Average Precision (mAP) over existing methods, thereby underscoring its efficacy in addressing the challenges of 3D pedestrian detection in complex semi-structured environments. Coding and benchmark are available.

Related papers

MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors [4.4714079610450765]
MultiEditor is a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds jointly.<n>We propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities.<n>Experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency.
arXiv Detail & Related papers (2025-07-29T14:42:52Z)
What You Have is What You Track: Adaptive and Robust Multimodal Tracking [72.92244578461869]
We present the first comprehensive study on tracker performance with temporally incomplete multimodal data.<n>Our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings.
arXiv Detail & Related papers (2025-07-08T11:40:21Z)
STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data [4.351581973358463]
Transformer-based approach, STaRFormer, serves as a universal framework for sequential modeling. STaRFormer employs a novel, dynamic attention-based regional masking scheme combined with semi-supervised contrastive learning to enhance task-specific latent representations.
arXiv Detail & Related papers (2025-04-14T11:03:19Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.<n>Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.<n>Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds [67.96583737413296]
We propose an enhanced classification method based on adaptive multi-scale fusion for MPCs with long-tailed distributions.<n>In the training set generation stage, a grid-balanced sampling strategy is designed to reliably generate training samples from sparse labeled datasets.<n>In the feature learning stage, a multi-scale feature fusion module is proposed to fuse shallow features of land-covers at different scales.
arXiv Detail & Related papers (2024-12-16T03:21:20Z)
Semantic Scene Completion Based 3D Traversability Estimation for Off-Road Terrains [10.521569910467072]
Off-road environments present significant challenges for autonomous ground vehicles.<n>Traditional perception algorithms, designed primarily for structured environments, often fail under these conditions.<n>In this paper, ORDformer is proposed to generate dense traversable occupancy predictions from a forward-facing perspective.
arXiv Detail & Related papers (2024-12-11T08:36:36Z)
One for All: Multi-Domain Joint Training for Point Cloud Based 3D Object Detection [71.78795573911512]
We propose textbfOneDet3D, a universal one-for-all model that addresses 3D detection across different domains. We propose the domain-aware in scatter and context, guided by a routing mechanism, to address the data interference issue. The fully sparse structure and anchor-free head further accommodate point clouds with significant scale disparities.
arXiv Detail & Related papers (2024-11-03T14:21:56Z)
Uni$^2$Det: Unified and Universal Framework for Prompt-Guided Multi-dataset 3D Detection [64.08296187555095]
Uni$2$Det is a framework for unified and universal multi-dataset training on 3D detection. We introduce multi-stage prompting modules for multi-dataset 3D detection. Results on zero-shot cross-dataset transfer validate the generalization capability of our proposed method.
arXiv Detail & Related papers (2024-09-30T17:57:50Z)
Multimodal Collaboration Networks for Geospatial Vehicle Detection in Dense, Occluded, and Large-Scale Events [29.86323896541765]
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments. We propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet.
arXiv Detail & Related papers (2024-05-14T00:51:15Z)
Let-It-Flow: Simultaneous Optimization of 3D Flow and Object Clustering [2.763111962660262]
We study the problem of self-supervised 3D scene flow estimation from real large-scale raw point cloud sequences. We propose a novel clustering approach that allows for combination of overlapping soft clusters as well as non-overlapping rigid clusters. Our method especially excels in resolving flow in complicated dynamic scenes with multiple independently moving objects close to each other.
arXiv Detail & Related papers (2024-04-12T10:04:03Z)
STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes [78.95447086305381]
Accurately detecting and tracking pedestrians in 3D space is challenging due to large variations in rotations, poses and scales. Existing benchmarks either only provide 2D annotations, or have limited 3D annotations with low-density pedestrian distribution. We introduce a large-scale multimodal dataset, STCrowd, to better evaluate pedestrian perception algorithms in crowded scenarios.
arXiv Detail & Related papers (2022-04-03T08:26:07Z)
PSE-Match: A Viewpoint-free Place Recognition Method with Parallel Semantic Embedding [9.265785042748158]
PSE-Match is a viewpoint-free place recognition method based on parallel semantic analysis of isolated semantic attributes from 3D point-cloud models. PSE-Match incorporates a divergence place learning network to capture different semantic attributes parallelly through the spherical harmonics domain.
arXiv Detail & Related papers (2021-08-01T22:16:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.