Semantic Segmentation on VSPW Dataset through Contrastive Loss and
Multi-dataset Training Approach
- URL: http://arxiv.org/abs/2306.03508v1
- Date: Tue, 6 Jun 2023 08:53:53 GMT
- Title: Semantic Segmentation on VSPW Dataset through Contrastive Loss and
Multi-dataset Training Approach
- Authors: Min Yan, Qianxiong Ning, Qian Wang
- Abstract summary: This paper presents the winning solution of the CVPR2023 workshop for video semantic segmentation.
Our approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place on the challenge at CVPR 2023.
- Score: 7.112725255953468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video scene parsing incorporates temporal information, which can enhance the
consistency and accuracy of predictions compared to image scene parsing. The
added temporal dimension enables a more comprehensive understanding of the
scene, leading to more reliable results. This paper presents the winning
solution of the CVPR2023 workshop for video semantic segmentation, focusing on
enhancing Spatial-Temporal correlations with contrastive loss. We also explore
the influence of multi-dataset training by utilizing a label-mapping technique.
And the final result is aggregating the output of the above two models. Our
approach achieves 65.95% mIoU performance on the VSPW dataset, ranked 1st place
on the VSPW challenge at CVPR 2023.
Related papers
- CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs [6.456189487006878]
We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs)
We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships.
CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization.
arXiv Detail & Related papers (2024-07-07T05:39:25Z) - Solution for CVPR 2024 UG2+ Challenge Track on All Weather Semantic Segmentation [9.322345758563886]
We present our solution for the semantic segmentation in adverse weather, in UG2+ Challenge at CVPR 2024.
We initialize the InternImage-H backbone with pre-trained weights from the large-scale joint dataset and enhance it with the state-of-the-art Upernet segmentation method.
Our proposed solution demonstrates advanced performance on the test set and achieves 3rd position in this challenge.
arXiv Detail & Related papers (2024-06-09T15:56:35Z) - Semantic Segmentation on VSPW Dataset through Masked Video Consistency [19.851665554201407]
We present our solution for the PVUW competition, where we introduce masked video (MVC) based on existing models.
MVC enforces consistency between predictions of masked random frames where patches are withheld.
Our approach achieves 67% mIoU performance on the VSPW dataset, ranking 2nd in the PVUW2024 VSS track.
arXiv Detail & Related papers (2024-06-07T14:41:24Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Temporal Contrastive Learning with Curriculum [19.442685015494316]
ConCur is a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy.
We conduct experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-09-02T00:12:05Z) - SatMAE: Pre-training Transformers for Temporal and Multi-Spectral
Satellite Imagery [74.82821342249039]
We present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE)
To leverage temporal information, we include a temporal embedding along with independently masking image patches across time.
arXiv Detail & Related papers (2022-07-17T01:35:29Z) - Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing [11.848929625911575]
We propose a Spatial-Temporal Semantic Consistency method to capture class-exclusive context information.
Specifically, we design a spatial-temporal consistency loss to constrain the semantic consistency in spatial and temporal dimensions.
Our method wins the 1st place on VSPW challenge at ICCV 2021.
arXiv Detail & Related papers (2021-09-06T08:24:38Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.