A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion
- URL: http://arxiv.org/abs/2601.20847v2
- Date: Thu, 29 Jan 2026 13:26:18 GMT
- Title: A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion
- Authors: Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova, Leonardo Silvino Brito, Rafael Raider Leoni, Juliano Silva Filho, Valter Ferreira, Sibele Miguel Soares Neto, Samantha Uehara, Daniel Giacometti Amaral, João Marcelo Teixeira, Veronica Teichrieb, Cristiano Coelho de Araújo,
- Abstract summary: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems.<n>Existing RSC techniques often fail to generalize beyond narrow operational conditions.<n>This work introduces a multimodal framework that fuses images and inertial measurements.
- Score: 3.571515153090507
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
Related papers
- DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction [32.51260718935461]
We propose DSFC-Net, a dual-encoder framework that fuses spatial and frequency-domain information.<n>CFIA module explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy.<n>Experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.
arXiv Detail & Related papers (2026-02-01T15:23:42Z) - Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations [43.30169413561605]
Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images.<n>Existing studies largely overlook the complex environmental factors encountered in real world scenarios.<n>This paper introduces an environment-robust FSS setting that explicitly incorporates challenging test cases arising from complex environments.
arXiv Detail & Related papers (2026-01-07T05:27:12Z) - A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations [2.312232949770907]
Rolling-element bearings are among the most frequent causes of machinery failure.<n>Rolling-element bearings are among the most frequent causes of machinery failure.<n>Existing approaches often suffer from poor generalization, lack of robustness, high data demands, and limited interpretability.
arXiv Detail & Related papers (2025-12-07T07:38:36Z) - Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining [95.00432497331583]
Multi-Prior Hierarchical Mamba (MPHM) network for image deraining.<n>MPHM integrates macro-semantic textual priors (CLIP) for task-level semantic guidance and micro-structural visual priors (DINOv2) for scene-aware structural information.<n>Experiments demonstrate MPHM's state-of-the-art performance, achieving a 0.57 dB PSNR gain on the Rain200H dataset.
arXiv Detail & Related papers (2025-11-17T08:08:59Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - DFYP: A Dynamic Fusion Framework with Spectral Channel Attention and Adaptive Operator learning for Crop Yield Prediction [18.24061967822792]
DFYP is a novel Dynamic Fusion framework for crop Yield Prediction.<n>It combines spectral channel attention, edge-adaptive spatial modeling and a learnable fusion mechanism.<n> DFYP consistently outperforms current state-of-the-art baselines in RMSE, MAE, and R2.
arXiv Detail & Related papers (2025-07-08T10:24:04Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [73.75271615101754]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments [73.80718037070773]
We present the multi-modal Pedestrian-Focused Scene dataset, rigorously annotated in semi-structured scenes with the format of nuScenes.<n>We also propose a novel Hybrid Multi-Scale Fusion Network (HMFN) to detect pedestrians in densely populated and occluded scenarios.
arXiv Detail & Related papers (2025-02-21T09:57:53Z) - Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.