Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection
- URL: http://arxiv.org/abs/2501.04969v2
- Date: Tue, 07 Oct 2025 02:07:45 GMT
- Title: Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection
- Authors: Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, Anna Choromanska,
- Abstract summary: We present AD-L-JEPA, a novel self-supervised pre-training framework for autonomous driving.<n>Unlike existing methods, AD-L-JEPA is neither generative nor contrastive.<n>It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning.
- Score: 10.19369242630191
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-supervised representation learning relying on vast amounts of unlabeled data has been explored as a pre-training method for autonomous driving. However, directly applying popular contrastive or generative methods to this problem is insufficient and may even lead to negative transfer. In this paper, we present AD-L-JEPA, a novel self-supervised pre-training framework with a joint embedding predictive architecture (JEPA) for automotive LiDAR object detection. Unlike existing methods, AD-L-JEPA is neither generative nor contrastive. Instead of explicitly generating masked regions, our method predicts Bird's-Eye-View embeddings to capture the diverse nature of driving scenes. Furthermore, our approach eliminates the need to manually form contrastive pairs by employing explicit variance regularization to avoid representation collapse. Experimental results demonstrate consistent improvements on the LiDAR 3D object detection downstream task across the KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE. Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61 mAP gain, better than all other methods pre-trained on either 100K or 500K frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes the first JEPA-based pre-training method for autonomous driving. It offers better quality, faster, and more GPU-memory-efficient self-supervised representation learning. The source code of AD-L-JEPA is ready to be released.
Related papers
- CrossJEPA: Cross-Modal Joint-Embedding Predictive Architecture for Efficient 3D Representation Learning from 2D Images [1.20952748584685]
Cross-modal cross-modal learning has emerged to address the scarcity of large-scale 3D datasets in 3D representation learning.<n>We propose CrossJEPA, a simple Cross-modal Joint Embedding Predictive Architecture that harnesses the knowledge of an image foundation model and trains a predictor to infer embeddings of specific rendered 2D views from corresponding 3D point clouds.
arXiv Detail & Related papers (2025-11-23T12:40:04Z) - LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics [53.247652209132376]
Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D.<n>We present a comprehensive theory of JEPAs and instantiate it in bf LeJEPA, a lean, scalable, and theoretically grounded training objective.
arXiv Detail & Related papers (2025-11-11T18:21:55Z) - VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process [40.3578745624081]
We propose a vision-language autonomous driving model, which integrates a fine-tuned Visual Language Models (VLMs) with a state-of-the-art end-to-end system.<n>We implement a specialized fine-tuning approach using custom question-answer datasets designed specifically to improve the spatial reasoning capabilities of the model.<n>Our system produces interpretable natural language explanations of driving decisions, thereby increasing transparency and trustworthiness of the traditionally black-box end-to-end architecture.
arXiv Detail & Related papers (2025-07-02T01:52:40Z) - Generative AI for Autonomous Driving: Frontiers and Opportunities [145.6465312554513]
This survey delivers a comprehensive synthesis of the emerging role of GenAI across the autonomous driving stack.<n>We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models.<n>We categorize practical applications, such as synthetic data generalization, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI.
arXiv Detail & Related papers (2025-05-13T17:59:20Z) - Efficient Adversarial Detection Frameworks for Vehicle-to-Microgrid Services in Edge Computing [6.75253870287079]
Malicious actors exploit vulnerabilities in Machine Learning algorithms to disrupt power generation and distribution.<n>We propose a novel strategy that optimize detection models for Vehicle-to-Microgrid (V2M) edge environments.<n>Our approach integrates model design and compression into a unified process and results in a highly compact detection model.
arXiv Detail & Related papers (2025-03-25T03:26:49Z) - DiffAD: A Unified Diffusion Modeling Approach for Autonomous Driving [17.939192289319056]
We introduce DiffAD, a novel diffusion probabilistic model that redefines autonomous driving as a conditional image generation task.
Byizing heterogeneous targets onto a unified bird's-eye view (BEV) and modeling their latent distribution, DiffAD unifies various driving objectives.
The reverse process iteratively refines the generated BEV image, resulting in more robust and realistic driving behaviors.
arXiv Detail & Related papers (2025-03-15T15:23:35Z) - SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
Multimodal Large Language Models (MLLMs) can process both visual and textual data.
We propose SafeAuto, a novel framework that enhances MLLM-based autonomous driving systems by incorporating both unstructured and structured knowledge.
arXiv Detail & Related papers (2025-02-28T21:53:47Z) - The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process.
DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z) - TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning [61.33599727106222]
TeLL-Drive is a hybrid framework that integrates a Teacher LLM to guide an attention-based Student DRL policy.
A self-attention mechanism then fuses these strategies with the DRL agent's exploration, accelerating policy convergence and boosting robustness.
arXiv Detail & Related papers (2025-02-03T14:22:03Z) - OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework [3.8320050452121692]
We introduce OWLed, the Outlier-Weighed Layerwise Pruning for Efficient Autonomous Driving Framework.<n>Our method assigns non-uniform sparsity ratios to different layers based on the distribution of outlier features.<n>To ensure the compressed model adapts well to autonomous driving tasks, we incorporate driving environment data into both the calibration and pruning processes.
arXiv Detail & Related papers (2024-11-12T10:55:30Z) - Self-Updating Vehicle Monitoring Framework Employing Distributed Acoustic Sensing towards Real-World Settings [5.306938463648908]
We introduce a real-time semi-supervised vehicle monitoring framework tailored to urban settings.
It requires only a small fraction of manual labels for initial training and exploits unlabeled data for model improvement.
We propose a novel prior loss that incorporates the shapes of vehicular traces to track a single vehicle with varying speeds.
arXiv Detail & Related papers (2024-09-16T13:10:58Z) - EditFollower: Tunable Car Following Models for Customizable Adaptive Cruise Control Systems [28.263763430300504]
We propose a data-driven car-following model that allows for adjusting driving discourtesy levels.
Our model provides valuable insights for the development of ACC systems that take into account drivers' social preferences.
arXiv Detail & Related papers (2024-06-23T15:04:07Z) - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents.
We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator.
We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z) - AD-H: Autonomous Driving with Hierarchical Agents [64.49185157446297]
We propose to connect high-level instructions and low-level control signals with mid-level language-driven commands.
We implement this idea through a hierarchical multi-agent driving system named AD-H.
arXiv Detail & Related papers (2024-06-05T17:25:46Z) - Foundation Models for Structural Health Monitoring [14.36493796970864]
We propose for the first time the use of Transformer neural networks, with a Masked Auto-Encoder architecture, as Foundation Models for Structural Health Monitoring.<n>We demonstrate the ability of these models to learn generalizable representations from multiple large datasets through self-supervised pre-training.<n>We showcase the effectiveness of our foundation models using data from three operational viaducts.
arXiv Detail & Related papers (2024-04-03T13:32:44Z) - Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach [13.513005108086006]
We propose an efficient BEV-based 3D detection framework called BEVENet.
BEVENet is 3$times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge.
Our experiments show that BEVENet is 3$times$ faster than contemporary state-of-the-art (SOTA) approaches.
arXiv Detail & Related papers (2023-12-01T14:52:59Z) - Applications of Large Scale Foundation Models for Autonomous Driving [22.651585322658686]
Large language models (LLMs) and chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP)
In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.
arXiv Detail & Related papers (2023-11-20T19:45:27Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - Integrated Decision and Control for High-Level Automated Vehicles by
Mixed Policy Gradient and Its Experiment Verification [10.393343763237452]
This paper presents a self-evolving decision-making system based on the Integrated Decision and Control (IDC)
An RL algorithm called constrained mixed policy gradient (CMPG) is proposed to consistently upgrade the driving policy of the IDC.
Experiment results show that boosting by data, the system can achieve better driving ability over model-based methods.
arXiv Detail & Related papers (2022-10-19T14:58:41Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous
Driving [94.11868795445798]
We release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories.
To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes.
We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models.
arXiv Detail & Related papers (2021-06-21T13:55:57Z) - End-to-End Semi-Supervised Object Detection with Soft Teacher [63.26266730447914]
This paper presents an end-to-end semi-supervised object detection approach, in contrast to previous more complex multi-stage methods.
The proposed approach outperforms previous methods by a large margin under various labeling ratios.
On the state-of-the-art Swin Transformer-based object detector, it can still significantly improve the detection accuracy by +1.5 mAP.
arXiv Detail & Related papers (2021-06-16T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.