CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes
- URL: http://arxiv.org/abs/2310.09761v1
- Date: Sun, 15 Oct 2023 07:20:22 GMT
- Title: CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes
- Authors: Yulei Qin, Xingyu Chen, Yunhang Shen, Chaoyou Fu, Yun Gu, Ke Li, Xing
Sun, Rongrong Ji
- Abstract summary: Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
- Score: 93.71909293023663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Webly supervised learning has attracted increasing attention for its
effectiveness in exploring publicly accessible data at scale without manual
annotation. However, most existing methods of learning with web datasets are
faced with challenges from label noise, and they have limited assumptions on
clean samples under various noise. For instance, web images retrieved with
queries of tiger cat (a cat species) and drumstick (a musical instrument) are
almost dominated by images of tigers and chickens, which exacerbates the
challenge of fine-grained visual concept learning. In this case, exploiting
both web images and their associated texts is a requisite solution to combat
real-world noise. In this paper, we propose Cross-modality Aligned Prototypes
(CAPro), a unified prototypical contrastive learning framework to learn visual
representations with correct semantics. For one thing, we leverage textual
prototypes, which stem from the distinct concept definition of classes, to
select clean images by text matching and thus disambiguate the formation of
visual prototypes. For another, to handle missing and mismatched noisy texts,
we resort to the visual feature space to complete and enhance individual texts
and thereafter improve text matching. Such semantically aligned visual
prototypes are further polished up with high-quality samples, and engaged in
both cluster regularization and noise removal. Besides, we propose collective
bootstrapping to encourage smoother and wiser label reference from
appearance-similar instances in a manner of dictionary look-up. Extensive
experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro well
handles realistic noise under both single-label and multi-label scenarios.
CAPro achieves new state-of-the-art performance and exhibits robustness to
open-set recognition. Codes are available at https://github.com/yuleiqin/capro.
Related papers
- Vision-Language Models are Strong Noisy Label Detectors [76.07846780815794]
This paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models.
DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels.
Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
arXiv Detail & Related papers (2024-09-29T12:55:17Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Text as Image: Learning Transferable Adapter for Multi-Label
Classification [13.11583340598517]
We introduce an effective approach to employ large language models for multi-label instruction-following text generation.
In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data.
arXiv Detail & Related papers (2023-12-07T09:22:20Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Attention-Aware Noisy Label Learning for Image Classification [97.26664962498887]
Deep convolutional neural networks (CNNs) learned on large-scale labeled samples have achieved remarkable progress in computer vision.
The cheapest way to obtain a large body of labeled visual data is to crawl from websites with user-supplied labels, such as Flickr.
This paper proposes the attention-aware noisy label learning approach to improve the discriminative capability of the network trained on datasets with potential label noise.
arXiv Detail & Related papers (2020-09-30T15:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.