Towards Robust Semantic Segmentation of Accident Scenes via Multi-Source
Mixed Sampling and Meta-Learning
- URL: http://arxiv.org/abs/2203.10395v1
- Date: Sat, 19 Mar 2022 21:18:54 GMT
- Title: Towards Robust Semantic Segmentation of Accident Scenes via Multi-Source
Mixed Sampling and Meta-Learning
- Authors: Xinyu Luo, Jiaming Zhang, Kailun Yang, Alina Roitberg, Kunyu Peng,
Rainer Stiefelhagen
- Abstract summary: We propose a Multi-source Meta-learning Unsupervised Domain Adaptation framework, to improve the generalization of segmentation transformers to extreme accident scenes.
Our approach achieves a mIoU score of 46.97% on the DADA-seg benchmark, surpassing the previous state-of-the-art model by more than 7.50%.
- Score: 29.74171323437029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous vehicles utilize urban scene segmentation to understand the real
world like a human and react accordingly. Semantic segmentation of normal
scenes has experienced a remarkable rise in accuracy on conventional
benchmarks. However, a significant portion of real-life accidents features
abnormal scenes, such as those with object deformations, overturns, and
unexpected traffic behaviors. Since even small mis-segmentation of driving
scenes can lead to serious threats to human lives, the robustness of such
models in accident scenarios is an extremely important factor in ensuring
safety of intelligent transportation systems.
In this paper, we propose a Multi-source Meta-learning Unsupervised Domain
Adaptation (MMUDA) framework, to improve the generalization of segmentation
transformers to extreme accident scenes. In MMUDA, we make use of Multi-Domain
Mixed Sampling to augment the images of multiple-source domains (normal scenes)
with the target data appearances (abnormal scenes). To train our model, we
intertwine and study a meta-learning strategy in the multi-source setting for
robustifying the segmentation results. We further enhance the segmentation
backbone (SegFormer) with a HybridASPP decoder design, featuring large window
attention spatial pyramid pooling and strip pooling, to efficiently aggregate
long-range contextual dependencies. Our approach achieves a mIoU score of
46.97% on the DADA-seg benchmark, surpassing the previous state-of-the-art
model by more than 7.50%. Code will be made publicly available at
https://github.com/xinyu-laura/MMUDA.
Related papers
- JointMotion: Joint Self-Supervision for Joint Motion Prediction [10.44846560021422]
JointMotion is a self-supervised pre-training method for joint motion prediction in self-driving vehicles.
Our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3%, 8%, and 12%, respectively.
arXiv Detail & Related papers (2024-03-08T17:54:38Z) - MS-Net: A Multi-Path Sparse Model for Motion Prediction in Multi-Scenes [1.4451387915783602]
Multi-Scenes Network (aka MS-Net) is a multi-path sparse model trained by an evolutionary process.
MS-Net selectively activates a subset of its parameters during the inference stage to produce prediction results for each scene.
Our experiment results show that MS-Net outperforms existing state-of-the-art methods on well-established pedestrian motion prediction datasets.
arXiv Detail & Related papers (2024-03-01T08:32:12Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - You Only Look at Once for Real-time and Generic Multi-Task [20.61477620156465]
A-YOLOM is an adaptive, real-time, and lightweight multi-task model.
We develop an end-to-end multi-task model with a unified and streamlined segmentation structure.
We achieve competitive results on the BDD100k dataset.
arXiv Detail & Related papers (2023-10-02T21:09:43Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - MCENET: Multi-Context Encoder Network for Homogeneous Agent Trajectory
Prediction in Mixed Traffic [35.22312783822563]
Trajectory prediction in urban mixedtraffic zones is critical for many intelligent transportation systems.
We propose an approach named Multi-Context Network (MCENET) that is trained by encoding both past and future scene context.
In inference time, we combine the past context and motion information of the target agent with samplings of the latent variables to predict multiple realistic trajectories.
arXiv Detail & Related papers (2020-02-14T11:04:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.