A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to
1st VCL Challenge -- Multi-Task Robustness Track
- URL: http://arxiv.org/abs/2402.17319v1
- Date: Tue, 27 Feb 2024 08:51:20 GMT
- Title: A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to
1st VCL Challenge -- Multi-Task Robustness Track
- Authors: Zehui Chen, Qiuchen Wang, Zhenyu Li, Jiaming Liu, Shanghang Zhang,
Feng Zhao
- Abstract summary: We propose a framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model.
Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks.
The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score.
- Score: 31.754017006309564
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this report, we present our solution to the multi-task robustness track of
the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We
propose a vanilla framework named UniNet that seamlessly combines various
visual perception algorithms into a multi-task model. Specifically, we choose
DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance
segmentation, and depth estimation tasks, respectively. The final submission is
a single model with InternImage-L backbone, and achieves a 49.6 overall score
(29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog) on SHIFT validation
set. Besides, we provide some interesting observations in our experiments which
may facilitate the development of multi-task learning in dense visual
prediction.
Related papers
- First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Atomic Activity Recognition 2024 [5.674251666234644]
This report presents our team's technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge.
The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content.
Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting.
arXiv Detail & Related papers (2024-10-30T15:06:58Z) - A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks.
Recent solutions are mainly all-in-one models.
We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - 3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic
Segmentation on VSPW [68.56017675820897]
In this paper, we introduce 3rd place solution for PVUW2023 VSS track.
We have explored various image-level visual backbones and segmentation heads to tackle the problem of video semantic segmentation.
arXiv Detail & Related papers (2023-06-04T07:50:38Z) - Simultaneous Multiple Object Detection and Pose Estimation using 3D
Model Infusion with Monocular Vision [21.710141497071373]
Multiple object detection and pose estimation are vital computer vision tasks.
We propose simultaneous neural modeling of both using monocular vision and 3D model infusion.
Our Simultaneous Multiple Object detection and Pose Estimation network (SMOPE-Net) is an end-to-end trainable multitasking network.
arXiv Detail & Related papers (2022-11-21T05:18:56Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.