Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D
Object Detection
- URL: http://arxiv.org/abs/2210.02443v1
- Date: Wed, 5 Oct 2022 17:59:51 GMT
- Title: Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D
Object Detection
- Authors: Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani,
Masayoshi Tomizuka, Wei Zhan
- Abstract summary: Current 3D detection methods use limited history to improve object perception.
Our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set.
- Score: 63.809086864530784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent camera-only 3D detection methods leverage multiple timesteps,
the limited history they use significantly hampers the extent to which temporal
fusion can improve object perception. Observing that existing works' fusion of
multi-frame images are instances of temporal stereo matching, we find that
performance is hindered by the interplay between 1) the low granularity of
matching resolution and 2) the sub-optimal multi-view setup produced by limited
history usage. Our theoretical and empirical analysis demonstrates that the
optimal temporal difference between views varies significantly for different
pixels and depths, making it necessary to fuse many timesteps over long-term
history. Building on our investigation, we propose to generate a cost volume
from a long history of image observations, compensating for the coarse but
efficient matching resolution with a more optimal multi-view matching setup.
Further, we augment the per-frame monocular depth predictions used for
long-term, coarse matching with short-term, fine-grained matching and find that
long and short term temporal fusion are highly complementary. While maintaining
high efficiency, our framework sets new state-of-the-art on nuScenes, achieving
first place on the test set and outperforming previous best art by 5.2% mAP and
3.7% NDS on the validation set. Code will be released
$\href{https://github.com/Divadi/SOLOFusion}{here.}$
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation [22.018059988585403]
M$2$Depth is designed to predict reliable scale-aware surrounding depth in autonomous driving.
We first construct cost volumes in spatial and temporal domains individually.
We propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation.
arXiv Detail & Related papers (2024-05-03T11:06:37Z) - PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection [66.94819989912823]
We propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection.
We use point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement.
We conduct extensive experiments on the large-scale dataset to demonstrate that our approach performs well against state-of-the-art methods.
arXiv Detail & Related papers (2023-12-13T18:59:13Z) - Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping [12.442574943138794]
The paper explores the industrial multimodal Anomaly Detection (AD) task, which exploits point clouds and RGB images to localize anomalies.
We introduce a novel light and fast framework that learns to map features from one modality to the other on nominal samples.
arXiv Detail & Related papers (2023-12-07T18:41:21Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z) - Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception [27.598461348452343]
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View 3D perception.
Existing methods are mostly in a parallel manner.
We name this simple but effective fusing pipeline VideoBEV.
arXiv Detail & Related papers (2023-03-10T15:01:51Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - STS: Surround-view Temporal Stereo for Multi-view 3D Detection [28.137180365082976]
We propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning.
Experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects.
arXiv Detail & Related papers (2022-08-22T08:46:33Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.