UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal
Representation in Bird's-Eye-View
- URL: http://arxiv.org/abs/2207.08536v2
- Date: Mon, 20 Mar 2023 03:12:09 GMT
- Title: UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal
Representation in Bird's-Eye-View
- Authors: Zequn Qin, Jingyu Chen, Chao Chen, Xiaozhi Chen, Xi Li
- Abstract summary: We propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation.
With the proposed unified spatial-temporal fusion, our method could support long-range fusion.
Our method gains the state-of-the-art performance in the map segmentation task.
- Score: 20.169308746548587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bird's eye view (BEV) representation is a new perception formulation for
autonomous driving, which is based on spatial fusion. Further, temporal fusion
is also introduced in BEV representation and gains great success. In this work,
we propose a new method that unifies both spatial and temporal fusion and
merges them into a unified mathematical formulation. The unified fusion could
not only provide a new perspective on BEV fusion but also brings new
capabilities. With the proposed unified spatial-temporal fusion, our method
could support long-range fusion, which is hard to achieve in conventional BEV
methods. Moreover, the BEV fusion in our work is temporal-adaptive and the
weights of temporal fusion are learnable. In contrast, conventional methods
mainly use fixed and equal weights for temporal fusion. Besides, the proposed
unified fusion could avoid information lost in conventional BEV fusion methods
and make full use of features. Extensive experiments and ablation studies on
the NuScenes dataset show the effectiveness of the proposed method and our
method gains the state-of-the-art performance in the map segmentation task.
Related papers
- PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation [42.879223792782334]
This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance.
We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies.
Our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives.
arXiv Detail & Related papers (2024-12-19T13:12:15Z) - CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model [20.02742423120295]
Current generative models based fusion methods often suffer from unstable training and slow inference speed.
CoMoFusion can generate the high-quality images and achieve fast image inference speed.
In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed.
arXiv Detail & Related papers (2024-05-31T12:35:06Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models [18.184158874126545]
We investigate how different fusion strategies can affect vision-language alignment.
A specially designed intermediate fusion can boost text-to-image alignment with improved generation quality.
Our model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed.
arXiv Detail & Related papers (2024-03-25T08:16:06Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion [144.9653045465908]
We propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM)
Our approach yields promising fusion results in infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2023-03-13T04:06:42Z) - FusionVAE: A Deep Hierarchical Variational Autoencoder for RGB Image
Fusion [16.64908104831795]
We present a novel deep hierarchical variational autoencoder called FusionVAE that can serve as a basis for many fusion tasks.
Our approach is able to generate diverse image samples that are conditioned on multiple noisy, occluded, or only partially visible input images.
arXiv Detail & Related papers (2022-09-22T19:06:55Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z) - An Integrated Framework for the Heterogeneous Spatio-Spectral-Temporal
Fusion of Remote Sensing Images [22.72006711045537]
This paper first proposes a heterogeneous-integrated framework based on a novel residual residual cycle.
The proposed network can effectively fuse not only homogeneous but also heterogeneous information.
For the first time, a heterogeneous-integrated fusion framework is proposed to simultaneously merge the complementary heterogeneous spatial, spectral and temporal information.
arXiv Detail & Related papers (2021-09-01T14:29:23Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.