Face Recognition in the age of CLIP & Billion image datasets
- URL: http://arxiv.org/abs/2301.07315v1
- Date: Wed, 18 Jan 2023 05:34:57 GMT
- Title: Face Recognition in the age of CLIP & Billion image datasets
- Authors: Aaditya Bhat, Shrey Jain
- Abstract summary: We evaluate the performance of various CLIP models as zero-shot face recognizers.
We also investigate the robustness of CLIP models against data poisoning attacks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: CLIP (Contrastive Language-Image Pre-training) models developed by OpenAI
have achieved outstanding results on various image recognition and retrieval
tasks, displaying strong zero-shot performance. This means that they are able
to perform effectively on tasks for which they have not been explicitly
trained. Inspired by the success of OpenAI CLIP, a new publicly available
dataset called LAION-5B was collected which resulted in the development of open
ViT-H/14, ViT-G/14 models that outperform the OpenAI L/14 model. The LAION-5B
dataset also released an approximate nearest neighbor index, with a web
interface for search & subset creation.
In this paper, we evaluate the performance of various CLIP models as
zero-shot face recognizers. Our findings show that CLIP models perform well on
face recognition tasks, but increasing the size of the CLIP model does not
necessarily lead to improved accuracy. Additionally, we investigate the
robustness of CLIP models against data poisoning attacks by testing their
performance on poisoned data. Through this analysis, we aim to understand the
potential consequences and misuse of search engines built using CLIP models,
which could potentially function as unintentional face recognition engines.
Related papers
- Enabling Small Models for Zero-Shot Classification through Model Label Learning [50.68074833512999]
We introduce a novel paradigm, Model Label Learning (MLL), which bridges the gap between models and their functionalities.
Experiments on seven real-world datasets validate the effectiveness and efficiency of MLL.
arXiv Detail & Related papers (2024-08-21T09:08:26Z) - Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP [3.5999252362400993]
We study whether vision-language models can successfully classify images with novel compositions of attribute-object pairs.
We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization.
Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.
arXiv Detail & Related papers (2024-03-27T12:59:44Z) - A Sober Look at the Robustness of CLIPs to Spurious Features [45.87070442259975]
We create a new dataset named CounterAnimal to reveal the reliance of CLIP models on realistic spurious features.
Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models.
arXiv Detail & Related papers (2024-03-18T06:04:02Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Adapting Contrastive Language-Image Pretrained (CLIP) Models for
Out-of-Distribution Detection [1.597617022056624]
We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection.
We propose a new simple and scalable method called textitpseudo-label probing (PLP) that adapts vision-language models for OOD detection.
arXiv Detail & Related papers (2023-03-10T10:02:18Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.