Related papers: Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

URL: http://arxiv.org/abs/2307.07812v1
Date: Sat, 15 Jul 2023 14:21:58 GMT
Title: Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation
Authors: Mennatullah Siam, Rezaul Karim, He Zhao, Richard Wildes
Abstract summary: We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange. Our approach outperforms the baseline and yields state-of-the-art performance.
Score: 8.16038976420041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot video segmentation is the task of delineating a specific novel class in a query video using few labelled support images. Typical approaches compare support and query features while limiting comparisons to a single feature layer and thereby ignore potentially valuable information. We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation that combines information across scales within a transformer decoder. Typical multiscale transformer decoders for segmentation tasks learn a compressed representation, their queries, through information exchange across scales. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange via a multiscale memory transformer decoding to reduce confusion between the background and novel class. Integral to the approach, we investigate multiple forms of information exchange across scales in different tasks and provide insights with empirical evidence on which to use in each task. The overall comparisons among query and support features benefit from both rich semantics and precise localization. We demonstrate our approach primarily on few-shot video object segmentation and an adapted version on the fully supervised counterpart. In all cases, our approach outperforms the baseline and yields state-of-the-art performance. Our code is publicly available at https://github.com/MSiam/MMC-MultiscaleMemory.

Related papers

Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis [1.6916040234975798]
Traditional deep learning methods in medical imaging often focus solely on segmentation or classification. We propose a simple yet effective UNet-based MTL model, where features extracted by the encoder are used to predict classification labels, while the decoder produces the segmentation mask. Experimental results across multiple medical datasets confirm the superior performance of our model in both segmentation and classification tasks.
arXiv Detail & Related papers (2024-11-30T04:20:05Z)
OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs. We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z)
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z)
MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation [6.053853367809978]
Existing few-shot segmentation methods are based on the meta-learning strategy and extract instance knowledge from a support set. We propose a multi-information aggregation network (MIANet) that effectively leverages the general knowledge, i.e., semantic word embeddings, and instance information for accurate segmentation. Experiments on PASCAL-5i and COCO-20i show that MIANet yields superior performance and set a new state-of-the-art.
arXiv Detail & Related papers (2023-05-23T09:36:27Z)
Look Before You Match: Instance Understanding Matters in Video Object Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS) We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank. We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches. OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z)
Learning Meta-class Memory for Few-Shot Semantic Segmentation [90.28474742651422]
We introduce the concept of meta-class, which is the meta information shareable among all classes. We propose a novel Meta-class Memory based few-shot segmentation method (MM-Net), where we introduce a set of learnable memory embeddings. Our proposed MM-Net achieves 37.5% mIoU on the COCO dataset in 1-shot setting, which is 5.1% higher than the previous state-of-the-art.
arXiv Detail & Related papers (2021-08-06T06:29:59Z)
Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.