OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving
- URL: http://arxiv.org/abs/2512.14225v1
- Date: Tue, 16 Dec 2025 09:18:15 GMT
- Title: OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving
- Authors: Tao Tang, Enhui Ma, xia zhou, Letian Wang, Tianyi Yan, Xueyang Zhang, Kun Zhan, Peng Jia, XianPeng Lang, Jia-Wang Bian, Kaicheng Yu, Xiaodan Liang,
- Abstract summary: We propose OminiGen, which generates aligned multimodal sensor data in a unified framework.<n>Our approach leverages a shared Birdu 2019s Eye View (BEV) space to unify multimodal features.<n>UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction.
- Score: 58.693329943871355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving has seen remarkable advancements, largely driven by extensive real-world data collection. However, acquiring diverse and corner-case data remains costly and inefficient. Generative models have emerged as a promising solution by synthesizing realistic sensor data. However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. Our approach leverages a shared Bird\u2019s Eye View (BEV) space to unify multimodal features and designs a novel generalizable multimodal reconstruction method, UAE, to jointly decode LiDAR and multi-view camera data. UAE achieves multimodal sensor decoding through volume rendering, enabling accurate and flexible reconstruction. Furthermore, we incorporate a Diffusion Transformer (DiT) with a ControlNet branch to enable controllable multimodal sensor generation. Our comprehensive experiments demonstrate that OminiGen achieves desired performances in unified multimodal sensor data generation with multimodal consistency and flexible sensor adjustments.
Related papers
- UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving [34.278528623978204]
UniDriveDreamer is a single-stage unified multimodal world model for autonomous driving.<n>It generates multimodal future observations without relying on intermediate representations or cascaded modules.<n>It outperforms previous state-of-the-art methods in both video and LiDAR generation.
arXiv Detail & Related papers (2026-02-02T12:02:27Z) - UniMPR: A Unified Framework for Multimodal Place Recognition with Heterogeneous Sensor Configurations [14.975915291012983]
We propose UniMPR, a unified framework for multimodal place recognition.<n>Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities.<n> Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-20T09:01:18Z) - A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles [74.8162337823142]
MM-UAV is the first large-scale benchmark for Multi-Modal UAV Tracking.<n>The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames.<n>Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework.
arXiv Detail & Related papers (2025-11-23T08:42:17Z) - MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion [2.7745600113170994]
We introduce the MultiSensor-Home dataset, a novel benchmark for comprehensive action recognition in home environments.<n>We also propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method.
arXiv Detail & Related papers (2025-04-03T05:23:08Z) - MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition [2.7745600113170994]
Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments.<n>We propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF)<n>The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views.
arXiv Detail & Related papers (2025-04-03T05:04:05Z) - Multi-modal Multi-platform Person Re-Identification: Benchmark and Method [58.59888754340054]
MP-ReID is a novel dataset designed specifically for multi-modality and multi-platform ReID.<n>This benchmark compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging.<n>We introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios.
arXiv Detail & Related papers (2025-03-21T12:27:49Z) - Graph-Based Multi-Modal Sensor Fusion for Autonomous Driving [3.770103075126785]
We introduce a novel approach to multi-modal sensor fusion, focusing on developing a graph-based state representation.
We present a Sensor-Agnostic Graph-Aware Kalman Filter, the first online state estimation technique designed to fuse multi-modal graphs.
We validate the effectiveness of our proposed framework through extensive experiments conducted on both synthetic and real-world driving datasets.
arXiv Detail & Related papers (2024-11-06T06:58:17Z) - CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes [56.52618054240197]
We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.<n>Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token.<n>Our model significantly improves robustness and accuracy, especially in adverse-condition scenarios.
arXiv Detail & Related papers (2024-10-14T17:56:20Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge
Devices [69.1412199244903]
We present a multi-tenant runtime for adaptive model execution with integrated MLOps on edge devices, e.g., a camera, a microphone, or IoT sensors.
S SensiX++ operates on two fundamental principles - highly modular componentisation to externalise data operations with clear abstractions and document-centric manifestation for system-wide orchestration.
We report on the overall throughput and quantified benefits of various automation components of SensiX++ and demonstrate its efficacy to significantly reduce operational complexity and lower the effort to deploy, upgrade, reconfigure and serve embedded models on edge devices.
arXiv Detail & Related papers (2021-09-08T22:06:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.