AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval
- URL: http://arxiv.org/abs/2409.06385v3
- Date: Sat, 08 Feb 2025 03:45:33 GMT
- Title: AMNS: Attention-Weighted Selective Mask and Noise Label Suppression for Text-to-Image Person Retrieval
- Authors: Runqing Zhang, Xue Zhou,
- Abstract summary: noisy correspondence (NC) issue exists due to poor image quality and labeling errors.
Random masking augmentation may inadvertently discard critical semantic content.
Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs.
Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples.
- Score: 3.591122855617648
- License:
- Abstract: Most existing text-to-image person retrieval methods usually assume that the training image-text pairs are perfectly aligned; however, the noisy correspondence(NC) issue (i.e., incorrect or unreliable alignment) exists due to poor image quality and labeling errors. Additionally, random masking augmentation may inadvertently discard critical semantic content, introducing noisy matches between images and text descriptions. To address the above two challenges, we propose a noise label suppression method to mitigate NC and an Attention-Weighted Selective Mask (AWM) strategy to resolve the issues caused by random masking. Specifically, the Bidirectional Similarity Distribution Matching (BSDM) loss enables the model to effectively learn from positive pairs while preventing it from over-relying on them, thereby mitigating the risk of overfitting to noisy labels. In conjunction with this, Weight Adjustment Focal (WAF) loss improves the model's ability to handle hard samples. Furthermore, AWM processes raw images through an EMA version of the image encoder, selectively retaining tokens with strong semantic connections to the text, enabling better feature extraction. Extensive experiments demonstrate the effectiveness of our approach in addressing noise-related issues and improving retrieval performance.
Related papers
- Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - A Robust Multisource Remote Sensing Image Matching Method Utilizing Attention and Feature Enhancement Against Noise Interference [15.591520484047914]
We propose a robust multisource remote sensing image matching method utilizing attention and feature enhancement against noise interference.
In the first stage, we combine deep convolution with the attention mechanism of transformer to perform dense feature extraction.
In the second stage, we introduce an outlier removal network based on a binary classification mechanism.
arXiv Detail & Related papers (2024-10-01T03:35:34Z) - SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining [2.9010546489056415]
Vision-language models (VLMs) have made significant strides in cross-modal understanding through paired datasets.
In fashion domain, datasets often exhibit a disparity between the information conveyed in image and text.
We propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text.
arXiv Detail & Related papers (2024-04-01T15:01:38Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - NLIP: Noise-robust Language-Image Pre-training [95.13287735264937]
We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
arXiv Detail & Related papers (2022-12-14T08:19:30Z) - Embedding contrastive unsupervised features to cluster in- and
out-of-distribution noise in corrupted image datasets [18.19216557948184]
Using search engines for web image retrieval is a tempting alternative to manual curation when creating an image dataset.
Their main drawback remains the proportion of incorrect (noisy) samples retrieved.
We propose a two stage algorithm starting with a detection step where we use unsupervised contrastive feature learning.
We find that the alignment and uniformity principles of contrastive learning allow OOD samples to be linearly separated from ID samples on the unit hypersphere.
arXiv Detail & Related papers (2022-07-04T16:51:56Z) - Adaptive Shrink-Mask for Text Detection [91.34459257409104]
Existing real-time text detectors reconstruct text contours by shrink-masks directly.
The dependence on predicted shrink-masks leads to unstable detection results.
Super-pixel Window (SPW) is designed to supervise the network.
arXiv Detail & Related papers (2021-11-18T07:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.