AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval
- URL: http://arxiv.org/abs/2409.06385v2
- Date: Wed, 11 Sep 2024 02:45:39 GMT
- Title: AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval
- Authors: Runqing Zhang, Xue Zhou,
- Abstract summary: Under-correlated and false-correlated problems arise for image-text pairs due to poor image quality and mislabeling.
We propose a new noise label suppression method and alleviate the problem generated by random mask.
- Score: 3.591122855617648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image person retrieval aims to retrieve images of person given textual descriptions, and most methods implicitly assume that the training image-text pairs are correctly aligned, but in practice, under-correlated and false-correlated problems arise for image-text pairs due to poor image quality and mislabeling. Meanwhile, the random masking augmentation strategy may incorrectly discard semantic content resulting in the problem of generating noisy pairings between image lexical elements and text descriptions. To solve these two problems, we propose a new noise label suppression method and alleviate the problem generated by random mask through an attention-weighted selective mask strategy. In the proposed noise label suppression method, the effect of noise labels is suppressed by preventing the model from being overconfident by considering the inverse KL scatter loss, which is combined with the weight adjustment focus loss to further improve the model's recognition ability on difficult samples. On the other hand, Attention-Weighted Selective Mask processes the raw image through the EMA version of the image encoder, retaining some of the tokens with strong semantic associations with the corresponding text descriptions in order to extract better features. Numerous experiments validate the effectiveness of our approach in terms of dealing with noisy problems. The code will be available soon at https://github.com/RunQing715/AMNS.git.
Related papers
- SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell
Checking [0.0]
This paper proposes an n-gram masking layer that masks current and/or surrounding tokens to avoid label leakage and error disturbance.
Experiments on SIGHAN datasets have demonstrated that the pluggable n-gram masking mechanism can improve the performance of prevalent CSC models.
arXiv Detail & Related papers (2023-05-05T06:43:56Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Few-shot Semantic Image Synthesis Using StyleGAN Prior [8.528384027684192]
We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior.
Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks.
Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles.
arXiv Detail & Related papers (2021-03-27T11:04:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.