Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object
Detection and Segmentation
- URL: http://arxiv.org/abs/2106.11401v1
- Date: Mon, 21 Jun 2021 20:30:44 GMT
- Title: Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object
Detection and Segmentation
- Authors: Eslam Mohamed and Ahmed El-Sallab
- Abstract summary: We present a Multi-Task Learning architecture, based on Transformers, to jointly perform both tasks through one network.
We evaluate the performance of the individual tasks architecture versus the MTL setup, both with early shared encoders, and late shared encoder-decoder transformers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Moving objects have special importance for Autonomous Driving tasks.
Detecting moving objects can be posed as Moving Object Segmentation, by
segmenting the object pixels, or Moving Object Detection, by generating a
bounding box for the moving targets. In this paper, we present a Multi-Task
Learning architecture, based on Transformers, to jointly perform both tasks
through one network. Due to the importance of the motion features to the task,
the whole setup is based on a Spatio-Temporal aggregation. We evaluate the
performance of the individual tasks architecture versus the MTL setup, both
with early shared encoders, and late shared encoder-decoder transformers. For
the latter, we present a novel joint tasks query decoder transformer, that
enables us to have tasks dedicated heads out of the shared model. To evaluate
our approach, we use the KITTI MOD [29] data set. Results show1.5% mAP
improvement for Moving Object Detection, and 2%IoU improvement for Moving
Object Segmentation, over the individual tasks networks.
Related papers
- Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation [28.16053631036079]
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects in a video.
We introduce a compact Transformer-based method, termed TenRMOT, to exploit the advantages of Transformer architecture.
TenRMOT demonstrates superior performance on both the referring multi-object tracking and the segmentation tasks.
arXiv Detail & Related papers (2024-10-17T11:07:05Z) - DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud.
Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks.
We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z) - A Simple yet Effective Network based on Vision Transformer for
Camouflaged Object and Salient Object Detection [33.30644598646274]
We propose a simple yet effective network (SENet) based on vision Transformer (ViT)
To enhance the Transformer's ability to model local information, we propose a local information capture module (LICM)
We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects.
arXiv Detail & Related papers (2024-02-29T07:29:28Z) - ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised
Video Object Segmentation [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16.
We also employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects.
arXiv Detail & Related papers (2023-07-05T03:43:15Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Semantics-Guided Moving Object Segmentation with 3D LiDAR [32.84782551737681]
Moving object segmentation (MOS) is a task to distinguish moving objects from the surrounding static environment.
We propose a semantics-guided convolutional neural network for moving object segmentation.
arXiv Detail & Related papers (2022-05-06T12:59:54Z) - Associating Objects with Transformers for Video Object Segmentation [74.51719591192787]
We propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly.
AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space.
We ranked 1st in the 3rd Large-scale Video Object Challenge.
arXiv Detail & Related papers (2021-06-04T17:59:57Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.