NLIP: Noise-robust Language-Image Pre-training
- URL: http://arxiv.org/abs/2212.07086v1
- Date: Wed, 14 Dec 2022 08:19:30 GMT
- Title: NLIP: Noise-robust Language-Image Pre-training
- Authors: Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing
Xu, Xiaodan Liang
- Abstract summary: We propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion.
Our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way.
- Score: 95.13287735264937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale cross-modal pre-training paradigms have recently shown ubiquitous
success on a wide range of downstream tasks, e.g., zero-shot classification,
retrieval and image captioning. However, their successes highly rely on the
scale and quality of web-crawled data that naturally contain incomplete and
noisy information (e.g., wrong or irrelevant content). Existing works either
design manual rules to clean data or generate pseudo-targets as auxiliary
signals for reducing noise impact, which do not explicitly tackle both the
incorrect and incomplete challenges simultaneously. In this paper, to
automatically mitigate the impact of noise by solely mining over existing data,
we propose a principled Noise-robust Language-Image Pre-training framework
(NLIP) to stabilize pre-training via two schemes: noise-harmonization and
noise-completion. First, in noise-harmonization scheme, NLIP estimates the
noise probability of each pair according to the memorization effect of
cross-modal transformers, then adopts noise-adaptive regularization to
harmonize the cross-modal alignments with varying degrees. Second, in
noise-completion scheme, to enrich the missing object information of text, NLIP
injects a concept-conditioned cross-modal decoder to obtain semantic-consistent
synthetic captions to complete noisy ones, which uses the retrieved visual
concepts (i.e., objects' names) for the corresponding image to guide captioning
generation. By collaboratively optimizing noise-harmonization and
noise-completion schemes, our NLIP can alleviate the common noise effects
during image-text pre-training in a more efficient way. Extensive experiments
show the significant performance improvements of our NLIP using only 26M data
over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot
classification datasets, MSCOCO image captioning and zero-shot image-text
retrieval tasks.
Related papers
- Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation [25.410770364140856]
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain.
This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs)
We introduce the notion of dynamic perturbation, which can inject controlled perturbations into the noise embeddings during inference.
arXiv Detail & Related papers (2024-09-03T02:29:01Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Advancing Unsupervised Low-light Image Enhancement: Noise Estimation, Illumination Interpolation, and Self-Regulation [55.07472635587852]
Low-Light Image Enhancement (LLIE) techniques have made notable advancements in preserving image details and enhancing contrast.
These approaches encounter persistent challenges in efficiently mitigating dynamic noise and accommodating diverse low-light scenarios.
We first propose a method for estimating the noise level in low light images in a quick and accurate way.
We then devise a Learnable Illumination Interpolator (LII) to satisfy general constraints between illumination and input.
arXiv Detail & Related papers (2023-05-17T13:56:48Z) - NASTAR: Noise Adaptive Speech Enhancement with Target-Conditional
Resampling [34.565077865854484]
We propose noise adaptive speech enhancement with target-conditional resampling (NASTAR)
NASTAR uses a feedback mechanism to simulate adaptive training data via a noise extractor and a retrieval model.
Experimental results show that NASTAR can effectively use one noisy speech sample to adapt an SE model to a target condition.
arXiv Detail & Related papers (2022-06-18T00:15:48Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Adaptive noise imitation for image denoising [58.21456707617451]
We develop a new textbfadaptive noise imitation (ADANI) algorithm that can synthesize noisy data from naturally noisy images.
To produce realistic noise, a noise generator takes unpaired noisy/clean images as input, where the noisy image is a guide for noise generation.
Coupling the noisy data output from ADANI with the corresponding ground-truth, a denoising CNN is then trained in a fully-supervised manner.
arXiv Detail & Related papers (2020-11-30T02:49:36Z) - Distribution Conditional Denoising: A Flexible Discriminative Image
Denoiser [0.0]
A flexible discriminative image denoiser is introduced in which multi-task learning methods are applied to a densoising FCN based on U-Net.
It has been shown that this conditional training method can generalise a fixed noise level U-Net denoiser to a variety of noise levels.
arXiv Detail & Related papers (2020-11-24T21:27:18Z) - Unpaired Learning of Deep Image Denoising [80.34135728841382]
This paper presents a two-stage scheme by incorporating self-supervised learning and knowledge distillation.
For self-supervised learning, we suggest a dilated blind-spot network (D-BSN) to learn denoising solely from real noisy images.
Experiments show that our unpaired learning method performs favorably on both synthetic noisy images and real-world noisy photographs.
arXiv Detail & Related papers (2020-08-31T16:22:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.