Multi-branch Collaborative Learning Network for 3D Visual Grounding
- URL: http://arxiv.org/abs/2407.05363v2
- Date: Wed, 10 Jul 2024 11:31:50 GMT
- Title: Multi-branch Collaborative Learning Network for 3D Visual Grounding
- Authors: Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji,
- Abstract summary: 3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration.
We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task.
- Score: 66.67647903507927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capacity to learn specific information for each task, enabling them to acquire complementary knowledge. Thus, we propose the MCLN framework, which includes independent branches for 3DREC and 3DRES tasks. This enables dedicated exploration of each task and effective coordination between the branches. Furthermore, to facilitate mutual reinforcement between these branches, we introduce a Relative Superpoint Aggregation (RSA) module and an Adaptive Soft Alignment (ASA) module. These modules significantly contribute to the precise alignment of prediction results from the two branches, directing the module to allocate increased attention to key positions. Comprehensive experimental evaluation demonstrates that our proposed method achieves state-of-the-art performance on both the 3DREC and 3DRES tasks, with an increase of 2.05% in Acc@0.5 for 3DREC and 3.96% in mIoU for 3DRES.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization [51.33923845954759]
3D Visual Grounding (3DVG) and 3D Captioning (3DDC) are two crucial tasks in various 3D applications.
We propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks.
In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection.
arXiv Detail & Related papers (2024-04-17T04:46:27Z) - Spatio-Temporal Domain Awareness for Multi-Agent Collaborative
Perception [18.358998861454477]
Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the performance perception of autonomous vehicles over single-agent perception.
We propose SCOPE, a novel collaborative perception framework that aggregates awareness characteristics across agents in an end-to-end manner.
arXiv Detail & Related papers (2023-07-26T03:00:31Z) - CORE: Cooperative Reconstruction for Multi-Agent Perception [24.306731432524227]
CORE is a conceptually simple, effective and communication-efficient model for multi-agent cooperative perception.
It addresses the task from a novel perspective of cooperative reconstruction, based on two key insights.
We validate CORE on OPV2V, a large-scale multi-agent percetion dataset.
arXiv Detail & Related papers (2023-07-21T11:50:05Z) - A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks.
These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation.
Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z) - LIGHT: Joint Individual Building Extraction and Height Estimation from
Satellite Images through a Unified Multitask Learning Network [8.09909901104654]
Building extraction and height estimation are two important basic tasks in remote sensing image interpretation.
Most of the existing research regards the two tasks as independent studies.
In this work, we combine the individuaL buIlding extraction and heiGHt estimation through a unified multiTask learning network.
arXiv Detail & Related papers (2023-04-03T15:48:24Z) - Joint 2D-3D Multi-Task Learning on Cityscapes-3D: 3D Detection,
Segmentation, and Depth Estimation [11.608682595506354]
TaskPrompter presents an innovative multi-task prompting framework.
It unifies the learning of (i) task-generic representations, (ii) task-specific representations, and (iii) cross-task interactions.
New benchmark requires the multi-task model to concurrently generate predictions for monocular 3D vehicle detection, semantic segmentation, and monocular depth estimation.
arXiv Detail & Related papers (2023-04-03T13:41:35Z) - Learning to Relate Depth and Semantics for Unsupervised Domain
Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting.
We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions.
Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z) - DCR-Net: A Deep Co-Interactive Relation Network for Joint Dialog Act
Recognition and Sentiment Classification [77.59549450705384]
In dialog system, dialog act recognition and sentiment classification are two correlative tasks.
Most of the existing systems either treat them as separate tasks or just jointly model the two tasks.
We propose a Deep Co-Interactive Relation Network (DCR-Net) to explicitly consider the cross-impact and model the interaction between the two tasks.
arXiv Detail & Related papers (2020-08-16T14:13:32Z) - An Iterative Multi-Knowledge Transfer Network for Aspect-Based Sentiment
Analysis [73.7488524683061]
We propose a novel Iterative Multi-Knowledge Transfer Network (IMKTN) for end-to-end ABSA.
Our IMKTN transfers the task-specific knowledge from any two of the three subtasks to another one at the token level by utilizing a well-designed routing algorithm.
Experimental results on three benchmark datasets demonstrate the effectiveness and superiority of our approach.
arXiv Detail & Related papers (2020-04-04T13:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.