FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level
Gradient Calibration
- URL: http://arxiv.org/abs/2307.16617v1
- Date: Mon, 31 Jul 2023 12:50:15 GMT
- Title: FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level
Gradient Calibration
- Authors: Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu,
Xiaojun Chang, Xiaodan Liang
- Abstract summary: Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario.
Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima.
We propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization.
- Score: 89.4165092674947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modality fusion and multi-task learning are becoming trendy in 3D
autonomous driving scenario, considering robust prediction and computation
budget. However, naively extending the existing framework to the domain of
multi-modality multi-task learning remains ineffective and even poisonous due
to the notorious modality bias and task conflict. Previous works manually
coordinate the learning framework with empirical knowledge, which may lead to
sub-optima. To mitigate the issue, we propose a novel yet simple multi-level
gradient calibration learning framework across tasks and modalities during
optimization. Specifically, the gradients, produced by the task heads and used
to update the shared backbone, will be calibrated at the backbone's last layer
to alleviate the task conflict. Before the calibrated gradients are further
propagated to the modality branches of the backbone, their magnitudes will be
calibrated again to the same level, ensuring the downstream tasks pay balanced
attention to different modalities. Experiments on large-scale benchmark
nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute
14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D
detection, advancing the application of 3D autonomous driving in the domain of
multi-modality fusion and multi-task learning. We also discuss the links
between modalities and tasks.
Related papers
- Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations.
Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process.
We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z) - Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment [0.0]
"Harmonized Transfer Learning and Modality alignment (HarMA)" is a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment.
HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing.
arXiv Detail & Related papers (2024-04-28T17:20:08Z) - Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach [38.76462300149459]
We develop a Multi-objective Correction (MoCo) method for multi-objective gradient optimization.
The unique feature of our method is that it can guarantee convergence without increasing the non fairness gradient.
arXiv Detail & Related papers (2022-10-23T05:54:26Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Conflict-Averse Gradient Descent for Multi-task Learning [56.379937772617]
A major challenge in optimizing a multi-task model is the conflicting gradients.
We introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function.
CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss.
arXiv Detail & Related papers (2021-10-26T22:03:51Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.