Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation
- URL: http://arxiv.org/abs/2307.07812v1
- Date: Sat, 15 Jul 2023 14:21:58 GMT
- Title: Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation
- Authors: Mennatullah Siam, Rezaul Karim, He Zhao, Richard Wildes
- Abstract summary: We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation.
Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange.
Our approach outperforms the baseline and yields state-of-the-art performance.
- Score: 8.16038976420041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot video segmentation is the task of delineating a specific novel class
in a query video using few labelled support images. Typical approaches compare
support and query features while limiting comparisons to a single feature layer
and thereby ignore potentially valuable information. We present a meta-learned
Multiscale Memory Comparator (MMC) for few-shot video segmentation that
combines information across scales within a transformer decoder. Typical
multiscale transformer decoders for segmentation tasks learn a compressed
representation, their queries, through information exchange across scales.
Unlike previous work, we instead preserve the detailed feature maps during
across scale information exchange via a multiscale memory transformer decoding
to reduce confusion between the background and novel class. Integral to the
approach, we investigate multiple forms of information exchange across scales
in different tasks and provide insights with empirical evidence on which to use
in each task. The overall comparisons among query and support features benefit
from both rich semantics and precise localization. We demonstrate our approach
primarily on few-shot video object segmentation and an adapted version on the
fully supervised counterpart. In all cases, our approach outperforms the
baseline and yields state-of-the-art performance. Our code is publicly
available at https://github.com/MSiam/MMC-MultiscaleMemory.
Related papers
- Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis [1.6916040234975798]
Traditional deep learning methods in medical imaging often focus solely on segmentation or classification.
We propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask.
Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks.
arXiv Detail & Related papers (2024-11-30T04:20:05Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - MIANet: Aggregating Unbiased Instance and General Information for
Few-Shot Semantic Segmentation [6.053853367809978]
Existing few-shot segmentation methods are based on the meta-learning strategy and extract instance knowledge from a support set.
We propose a multi-information aggregation network (MIANet) that effectively leverages the general knowledge, i.e., semantic word embeddings, and instance information for accurate segmentation.
Experiments on PASCAL-5i and COCO-20i show that MIANet yields superior performance and set a new state-of-the-art.
arXiv Detail & Related papers (2023-05-23T09:36:27Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - Learning Meta-class Memory for Few-Shot Semantic Segmentation [90.28474742651422]
We introduce the concept of meta-class, which is the meta information shareable among all classes.
We propose a novel Meta-class Memory based few-shot segmentation method (MM-Net), where we introduce a set of learnable memory embeddings.
Our proposed MM-Net achieves 37.5% mIoU on the COCO dataset in 1-shot setting, which is 5.1% higher than the previous state-of-the-art.
arXiv Detail & Related papers (2021-08-06T06:29:59Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.