Related papers: FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration

FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration

URL: http://arxiv.org/abs/2307.16617v1
Date: Mon, 31 Jul 2023 12:50:15 GMT
Title: FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
Authors: Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu, Xiaojun Chang, Xiaodan Liang
Abstract summary: Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. We propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization.
Score: 89.4165092674947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario, considering robust prediction and computation budget. However, naively extending the existing framework to the domain of multi-modality multi-task learning remains ineffective and even poisonous due to the notorious modality bias and task conflict. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. To mitigate the issue, we propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization. Specifically, the gradients, produced by the task heads and used to update the shared backbone, will be calibrated at the backbone's last layer to alleviate the task conflict. Before the calibrated gradients are further propagated to the modality branches of the backbone, their magnitudes will be calibrated again to the same level, ensuring the downstream tasks pay balanced attention to different modalities. Experiments on large-scale benchmark nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute 14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D detection, advancing the application of 3D autonomous driving in the domain of multi-modality fusion and multi-task learning. We also discuss the links between modalities and tasks.

Related papers

Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding [4.649202831575798]
We propose MMPT, a Multi-modal Multi-task Pre-training framework to enhance point cloud understanding.<n>Three pre-training tasks are devised: Token-level reconstruction (TLR), Point-level reconstruction (PLR) and Multi-modal contrastive learning (MCL)<n>MCL combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities.
arXiv Detail & Related papers (2025-07-23T14:13:14Z)
An Effective End-to-End Solution for Multimodal Action Recognition [13.615924349022247]
We have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information.<n>We achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.
arXiv Detail & Related papers (2025-06-11T02:54:02Z)
M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving [48.17490295484055]
M3Net is a novel network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
arXiv Detail & Related papers (2025-03-23T15:08:09Z)
Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process. We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z)
Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z)
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment [0.0]
"Harmonized Transfer Learning and Modality alignment (HarMA)" is a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment. HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing.
arXiv Detail & Related papers (2024-04-28T17:20:08Z)
WHU-Synthetic: A Synthetic Perception Dataset for 3-D Multitask Model Research [9.945833036861892]
WHU-Synthetic is a large-scale 3D synthetic perception dataset designed for multi-task learning. We implement several novel settings, making it possible to realize certain ideas that are difficult to achieve in real-world scenarios.
arXiv Detail & Related papers (2024-02-29T11:38:44Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach [38.76462300149459]
We develop a Multi-objective Correction (MoCo) method for multi-objective gradient optimization. The unique feature of our method is that it can guarantee convergence without increasing the non fairness gradient.
arXiv Detail & Related papers (2022-10-23T05:54:26Z)
CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities. We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z)
SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method. We modernize the 3D convolutional backbone by introducing multi-head self-attention modules. In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z)
Conflict-Averse Gradient Descent for Multi-task Learning [56.379937772617]
A major challenge in optimizing a multi-task model is the conflicting gradients. We introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss.
arXiv Detail & Related papers (2021-10-26T22:03:51Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.