Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
- URL: http://arxiv.org/abs/2411.01494v1
- Date: Sun, 03 Nov 2024 09:21:17 GMT
- Title: Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
- Authors: Seongsu Ha, Chaeyun Kim, Donghwa Kim, Junho Lee, Sangho Lee, Joonseok Lee,
- Abstract summary: Recent RIS models still show a significant performance gap between easy and hard scenarios.
We propose a powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo)
NeMo augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model.
- Score: 18.429833114307513
- License:
- Abstract: Referring Image Segmentation is a comprehensive task to segment an object referred by a textual query from an image. In nature, the level of difficulty in this task is affected by the existence of similar objects and the complexity of the referring expression. Recent RIS models still show a significant performance gap between easy and hard scenarios. We pose that the bottleneck exists in the data, and propose a simple but powerful data augmentation method, Negative-mined Mosaic Augmentation (NeMo). This method augments a training image into a mosaic with three other negative images carefully curated by a pretrained multimodal alignment model, e.g., CLIP, to make the sample more challenging. We discover that it is critical to properly adjust the difficulty level, neither too ambiguous nor too trivial. The augmented training data encourages the RIS model to recognize subtle differences and relationships between similar visual entities and to concretely understand the whole expression to locate the right target better. Our approach shows consistent improvements on various datasets and models, verified by extensive experiments.
Related papers
- UnSeg: One Universal Unlearnable Example Generator is Enough against All Image Segmentation [64.01742988773745]
An increasing privacy concern exists regarding training large-scale image segmentation models on unauthorized private data.
We exploit the concept of unlearnable examples to make images unusable to model training by generating and adding unlearnable noise into the original images.
We empirically verify the effectiveness of UnSeg across 6 mainstream image segmentation tasks, 10 widely used datasets, and 7 different network architectures.
arXiv Detail & Related papers (2024-10-13T16:34:46Z) - Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes [4.418515380386838]
Mosaic data augmentation technique stitches multiple images together to increase the diversity and complexity of training data.
This paper proposes the Select-Mosaic data augmentation method, which is improved with a fine-grained region selection strategy.
The improved Select-Mosaic method demonstrates superior performance in handling dense small object detection tasks.
arXiv Detail & Related papers (2024-06-08T09:22:08Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - A Multi-Task Cross-Task Learning Architecture for Ad-hoc Uncertainty
Estimation in 3D Cardiac MRI Image Segmentation [0.0]
We present a Multi-task Cross-task learning consistency approach to enforce the correlation between the pixel-level (segmentation) and the geometric-level (distance map) tasks.
Our study further showcases the potential of our model to flag low-quality segmentation from a given model.
arXiv Detail & Related papers (2021-09-16T03:53:24Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Hard Negative Mixing for Contrastive Learning [29.91220669060252]
We argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected.
We propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead.
arXiv Detail & Related papers (2020-10-02T14:34:58Z) - Importance of Self-Consistency in Active Learning for Semantic
Segmentation [31.392212891018655]
We show that self-consistency can be a powerful source of self-supervision to improve the performance of a data-driven model with access to only a small amount of labeled data.
In our proposed active learning framework, we iteratively extract small image patches that need to be labeled.
We are able to find the image patches over which the current model struggles the most to classify.
arXiv Detail & Related papers (2020-08-04T22:18:35Z) - Adversarial Semantic Data Augmentation for Human Pose Estimation [96.75411357541438]
We propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity.
We also propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration.
State-of-the-art results are achieved on challenging benchmarks.
arXiv Detail & Related papers (2020-08-03T07:56:04Z) - Complementing Representation Deficiency in Few-shot Image
Classification: A Meta-Learning Approach [27.350615059290348]
We propose a meta-learning approach with complemented representations network (MCRNet) for few-shot image classification.
In particular, we embed a latent space, where latent codes are reconstructed with extra representation information to complement the representation deficiency.
Our end-to-end framework achieves the state-of-the-art performance in image classification on three standard few-shot learning datasets.
arXiv Detail & Related papers (2020-07-21T13:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.