Semantic Segmentation on VSPW Dataset through Aggregation of Transformer
Models
- URL: http://arxiv.org/abs/2109.01316v1
- Date: Fri, 3 Sep 2021 05:20:08 GMT
- Title: Semantic Segmentation on VSPW Dataset through Aggregation of Transformer
Models
- Authors: Zixuan Chen, Junhong Zou, Xiaotao Wang
- Abstract summary: This report introduces the solutions of team 'BetterThing' for the ICCV2021 - Video Scene Parsing in the Wild Challenge.
Transformer is used as the backbone for extracting video frame features, and the final result is the aggregation of the output of two Transformer models, SWIN and VOLO.
This solution achieves 57.3% mIoU, which is ranked 3rd place in the Video Scene Parsing in the Wild Challenge.
- Score: 10.478712332545854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation is an important task in computer vision, from which
some important usage scenarios are derived, such as autonomous driving, scene
parsing, etc. Due to the emphasis on the task of video semantic segmentation,
we participated in this competition. In this report, we briefly introduce the
solutions of team 'BetterThing' for the ICCV2021 - Video Scene Parsing in the
Wild Challenge. Transformer is used as the backbone for extracting video frame
features, and the final result is the aggregation of the output of two
Transformer models, SWIN and VOLO. This solution achieves 57.3% mIoU, which is
ranked 3rd place in the Video Scene Parsing in the Wild Challenge.
Related papers
- Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation [98.11452697097539]
In this technical report, we detail our first-place solution for the 2024 Open dataset Challenge's semantic segmentation track.
We significantly enhanced the performance of Point Transformer V3 on the benchmark by implementing cutting-edge, plug-and-play training and inference technologies.
This approach secured us the top position on the Open dataset segmentation leaderboard, markedly outperforming other entries.
arXiv Detail & Related papers (2024-07-21T22:08:52Z) - 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [8.20168024462357]
Motion Expression guided Video is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions.
We introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement.
Our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
arXiv Detail & Related papers (2024-06-20T02:16:23Z) - 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS)
Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z) - 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [12.274092278786966]
Video Panoptic (VPS) aims to simultaneously classify, track, segment all objects in a video.
We propose a robust integrated video panoptic segmentation solution.
Our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases.
arXiv Detail & Related papers (2024-06-01T17:03:16Z) - 3rd Place Solution for PVUW Challenge 2023: Video Panoptic Segmentation [10.04177400017471]
We propose a robust integrated video panoptic segmentation solution.
In our solution, we represent both semantic and instance targets as a set of queries.
We then combine these queries with video features extracted by neural networks to predict segmentation masks.
arXiv Detail & Related papers (2023-06-11T19:44:40Z) - 3rd Place Solution for PVUW2023 VSS Track: A Large Model for Semantic
Segmentation on VSPW [68.56017675820897]
In this paper, we introduce 3rd place solution for PVUW2023 VSS track.
We have explored various image-level visual backbones and segmentation heads to tackle the problem of video semantic segmentation.
arXiv Detail & Related papers (2023-06-04T07:50:38Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Local-Global Context Aware Transformer for Language-Guided Video
Segmentation [103.35509224722097]
We explore the task of language-guided video segmentation (LVS)
We present Locater, which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner.
To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset.
arXiv Detail & Related papers (2022-03-18T07:35:26Z) - Memory Based Video Scene Parsing [25.452807436316167]
We introduce our solution for the 1st Video Scene Parsing in the Wild Challenge, which achieves a mIoU of 57.44 and obtained the 2nd place (our team name is CharlesBLWX)
arXiv Detail & Related papers (2021-09-01T13:18:36Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.