RCDPT: Radar-Camera fusion Dense Prediction Transformer
- URL: http://arxiv.org/abs/2211.02432v1
- Date: Fri, 4 Nov 2022 13:16:20 GMT
- Title: RCDPT: Radar-Camera fusion Dense Prediction Transformer
- Authors: Chen-Chou Lo and Patrick Vandewalle
- Abstract summary: We propose a novel fusion strategy to integrate radar data into a vision transformer network.
Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model.
The experiments are conducted on the nuScenes dataset, which includes camera images, lidar, and radar data.
- Score: 1.5899159309486681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, transformer networks have outperformed traditional deep neural
networks in natural language processing and show a large potential in many
computer vision tasks compared to convolutional backbones. In the original
transformer, readout tokens are used as designated vectors for aggregating
information from other tokens. However, the performance of using readout tokens
in a vision transformer is limited. Therefore, we propose a novel fusion
strategy to integrate radar data into a dense prediction transformer network by
reassembling camera representations with radar representations. Instead of
using readout tokens, radar representations contribute additional depth
information to a monocular depth estimation model and improve performance. We
further investigate different fusion approaches that are commonly used for
integrating additional modality in a dense prediction transformer network. The
experiments are conducted on the nuScenes dataset, which includes camera
images, lidar, and radar data. The results show that our proposed method yields
better performance than the commonly used fusion strategies and outperforms
existing convolutional depth estimation models that fuse camera images and
radar.
Related papers
- Multimodal Transformers for Wireless Communications: A Case Study in
Beam Prediction [7.727175654790777]
We present a multimodal transformer deep learning framework for sensing-assisted beam prediction.
We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time.
Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%.
arXiv Detail & Related papers (2023-09-21T06:29:38Z) - Effective Image Tampering Localization via Enhanced Transformer and
Co-attention Fusion [5.691973573807887]
We propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder.
The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module.
arXiv Detail & Related papers (2023-09-17T15:43:06Z) - Semantic Segmentation of Radar Detections using Convolutions on Point
Clouds [59.45414406974091]
We introduce a deep-learning based method to convolve radar detections into point clouds.
We adapt this algorithm to radar-specific properties through distance-dependent clustering and pre-processing of input point clouds.
Our network outperforms state-of-the-art approaches that are based on PointNet++ on the task of semantic segmentation of radar point clouds.
arXiv Detail & Related papers (2023-05-22T07:09:35Z) - T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC
Radar Signals [0.0]
Object detection utilizing Frequency Modulated Continous Wave radar is becoming increasingly popular in the field of autonomous systems.
Radar does not possess the same drawbacks seen by other emission-based sensors such as LiDAR, primarily the degradation or loss of return signals due to weather conditions such as rain or snow.
We introduce hierarchical Swin Vision transformers to the field of radar object detection and show their capability to operate on inputs varying in pre-processing, along with different radar configurations.
arXiv Detail & Related papers (2023-03-29T18:04:19Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - Toward Data-Driven STAP Radar [23.333816677794115]
We characterize our data-driven approach to space-time adaptive processing (STAP) radar.
We generate a rich example dataset of received radar signals by randomly placing targets of variable strengths in a predetermined region.
For each data sample within this region, we generate heatmap tensors in range, azimuth, and elevation of the output power of a beamformer.
In an airborne scenario, the moving radar creates a sequence of these time-indexed image stacks, resembling a video.
arXiv Detail & Related papers (2022-01-26T02:28:13Z) - Learning Generative Vision Transformer with Energy-Based Latent Space
for Saliency Prediction [51.80191416661064]
We propose a novel vision transformer with latent variables following an informative energy-based prior for salient object detection.
Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation.
With the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image.
arXiv Detail & Related papers (2021-12-27T06:04:33Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.