Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving
- URL: http://arxiv.org/abs/2309.04422v2
- Date: Sun, 26 Nov 2023 15:25:11 GMT
- Title: Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving
- Authors: Thomas E. Huang, Yifan Liu, Luc Van Gool, Fisher Yu
- Abstract summary: Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
- Score: 85.62076860189116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performing multiple heterogeneous visual tasks in dynamic scenes is a
hallmark of human perception capability. Despite remarkable progress in image
and video recognition via representation learning, current research still
focuses on designing specialized networks for singular, homogeneous, or simple
combination of tasks. We instead explore the construction of a unified model
for major image and video recognition tasks in autonomous driving with diverse
input and output structures. To enable such an investigation, we design a new
challenge, Video Task Decathlon (VTD), which includes ten representative image
and video tasks spanning classification, segmentation, localization, and
association of objects and pixels. On VTD, we develop our unified network,
VTDNet, that uses a single structure and a single set of weights for all ten
tasks. VTDNet groups similar tasks and employs task interaction stages to
exchange information within and between task groups. Given the impracticality
of labeling all tasks on all frames, and the performance degradation associated
with joint training of many tasks, we design a Curriculum training,
Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on
all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly
outperforms its single-task counterparts on most tasks with only 20% overall
computations. VTD is a promising new direction for exploring the unification of
perception tasks in autonomous driving.
Related papers
- UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking
and Segmentation [31.167405688707575]
We propose a framework for instance-level visual analysis on video frames.
It can simultaneously conduct object detection, instance segmentation, and multi-object tracking.
We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets.
arXiv Detail & Related papers (2023-11-02T04:32:24Z) - Visual Exemplar Driven Task-Prompting for Unified Perception in
Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting.
Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories.
We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z) - Egocentric Video Task Translation [109.30649877677257]
We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
arXiv Detail & Related papers (2022-12-13T00:47:13Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks.
We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network.
Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z) - NeurAll: Towards a Unified Visual Perception Model for Automated Driving [8.49826472556323]
We propose a joint multi-task network design for learning several tasks simultaneously.
The main bottleneck in automated driving systems is the limited processing power available on deployment hardware.
arXiv Detail & Related papers (2019-02-10T12:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.