Related papers: Exploring Camera Encoder Designs for Autonomous Driving Perception

Exploring Camera Encoder Designs for Autonomous Driving Perception

URL: http://arxiv.org/abs/2407.07276v1
Date: Tue, 9 Jul 2024 23:44:58 GMT
Title: Exploring Camera Encoder Designs for Autonomous Driving Perception
Authors: Barath Lakshmanan, Joshua Chen, Shiyi Lan, Maying Shen, Zhiding Yu, Jose M. Alvarez,
Abstract summary: We develop an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline. We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.
Score: 36.65794720685284
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The cornerstone of autonomous vehicles (AV) is a solid perception system, where camera encoders play a crucial role. Existing works usually leverage pre-trained Convolutional Neural Networks (CNN) or Vision Transformers (ViTs) designed for general vision tasks, such as image classification, segmentation, and 2D detection. Although those well-known architectures have achieved state-of-the-art accuracy in AV-related tasks, e.g., 3D Object Detection, there remains significant potential for improvement in network design due to the nuanced complexities of industrial-level AV dataset. Moreover, existing public AV benchmarks usually contain insufficient data, which might lead to inaccurate evaluation of those architectures.To reveal the AV-specific model insights, we start from a standard general-purpose encoder, ConvNeXt and progressively transform the design. We adjust different design parameters including width and depth of the model, stage compute ratio, attention mechanisms, and input resolution, supported by systematic analysis to each modifications. This customization yields an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline. We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.

Related papers

Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection [12.245379864678291]
We tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder.<n>We directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning.<n>Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning.
arXiv Detail & Related papers (2025-07-24T14:55:33Z)
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors [18.149244316089284]
We present Pow3r, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results.
arXiv Detail & Related papers (2025-03-21T17:12:30Z)
Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters [32.21090169762889]
BALViT is a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.
arXiv Detail & Related papers (2025-03-05T09:30:49Z)
Multimodal Autoregressive Pre-training of Large Vision Encoders [85.39154488397931]
We present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification.
arXiv Detail & Related papers (2024-11-21T18:31:25Z)
EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder [3.2595221511180306]
We develop a novel approach that transforms images into an easy-to-classify image of its class. We incorporate a generalized algorithmic design of the Converting Autoencoders and intraclass clustering to identify representative images. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100.
arXiv Detail & Related papers (2024-04-21T20:45:18Z)
CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity [34.025530326420146]
We develop Complementary-BEV, a novel end-to-end monocular 3D object detection framework. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode.
arXiv Detail & Related papers (2023-10-04T13:38:53Z)
Perspective-aware Convolution for Monocular 3D Object Detection [2.33877878310217]
We propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We demonstrate improved performance on the KITTI3D dataset, achieving a 23.9% average precision in the easy benchmark.
arXiv Detail & Related papers (2023-08-24T17:25:36Z)
Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image. Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales. It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z)
Deep Learning Computer Vision Algorithms for Real-time UAVs On-board Camera Image Processing [77.34726150561087]
This paper describes how advanced deep learning based computer vision algorithms are applied to enable real-time on-board sensor processing for small UAVs. All algorithms have been developed using state-of-the-art image processing methods based on deep neural networks.
arXiv Detail & Related papers (2022-11-02T11:10:42Z)
Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS) The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z)
Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding. We show how to adapt vision transformers for self-supervised monocular depth estimation. Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z)
An Empirical Study of Training End-to-End Vision-and-Language Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.