FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level
Gradient Calibration
- URL: http://arxiv.org/abs/2307.16617v1
- Date: Mon, 31 Jul 2023 12:50:15 GMT
- Title: FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level
Gradient Calibration
- Authors: Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu,
Xiaojun Chang, Xiaodan Liang
- Abstract summary: Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario.
Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima.
We propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization.
- Score: 89.4165092674947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modality fusion and multi-task learning are becoming trendy in 3D
autonomous driving scenario, considering robust prediction and computation
budget. However, naively extending the existing framework to the domain of
multi-modality multi-task learning remains ineffective and even poisonous due
to the notorious modality bias and task conflict. Previous works manually
coordinate the learning framework with empirical knowledge, which may lead to
sub-optima. To mitigate the issue, we propose a novel yet simple multi-level
gradient calibration learning framework across tasks and modalities during
optimization. Specifically, the gradients, produced by the task heads and used
to update the shared backbone, will be calibrated at the backbone's last layer
to alleviate the task conflict. Before the calibrated gradients are further
propagated to the modality branches of the backbone, their magnitudes will be
calibrated again to the same level, ensuring the downstream tasks pay balanced
attention to different modalities. Experiments on large-scale benchmark
nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute
14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D
detection, advancing the application of 3D autonomous driving in the domain of
multi-modality fusion and multi-task learning. We also discuss the links
between modalities and tasks.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment [0.0]
"Harmonized Transfer Learning and Modality alignment (HarMA)" is a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment.
HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing.
arXiv Detail & Related papers (2024-04-28T17:20:08Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach [38.76462300149459]
We develop a Multi-objective Correction (MoCo) method for multi-objective gradient optimization.
The unique feature of our method is that it can guarantee convergence without increasing the non fairness gradient.
arXiv Detail & Related papers (2022-10-23T05:54:26Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Conflict-Averse Gradient Descent for Multi-task Learning [56.379937772617]
A major challenge in optimizing a multi-task model is the conflicting gradients.
We introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function.
CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss.
arXiv Detail & Related papers (2021-10-26T22:03:51Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.