MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
- URL: http://arxiv.org/abs/2411.19067v1
- Date: Thu, 28 Nov 2024 11:27:56 GMT
- Title: MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
- Authors: Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim,
- Abstract summary: Referring Image Code (RIS) is an advanced vision-aware task that involves identifying and segmenting objects within an image.
We propose a novel training framework called Masked Referring Image Code (MaskRIS)
MaskRIS uses both image and text masking, followed by Contextual Learning to fully exploit the benefits of the masking strategy.
- Score: 38.3201448852059
- License:
- Abstract: Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
Related papers
- MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation [109.19165503929992]
Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models.
We present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks.
We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z) - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework.
We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise.
We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z) - Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation [13.924553294859315]
Point PrompTing (PPT) is a point generator that harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability.
PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU.
arXiv Detail & Related papers (2024-04-18T08:46:12Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Improving self-supervised representation learning via sequential
adversarial masking [12.176299580413097]
Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision.
We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
arXiv Detail & Related papers (2022-12-16T04:25:43Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Masked Autoencoders are Robust Data Augmentors [90.34825840657774]
Regularization techniques like image augmentation are necessary for deep neural networks to generalize well.
We propose a novel perspective of augmentation to regularize the training process.
We show that utilizing such model-based nonlinear transformation as data augmentation can improve high-level recognition tasks.
arXiv Detail & Related papers (2022-06-10T02:41:48Z) - OLED: One-Class Learned Encoder-Decoder Network with Adversarial Context
Masking for Novelty Detection [1.933681537640272]
novelty detection is the task of recognizing samples that do not belong to the distribution of the target class.
Deep autoencoders have been widely used as a base of many unsupervised novelty detection methods.
We have designed a framework consisting of two competing networks, a Mask Module and a Reconstructor.
arXiv Detail & Related papers (2021-03-27T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.