Related papers: Training CLIP models on Data from Scientific Papers

Training CLIP models on Data from Scientific Papers

URL: http://arxiv.org/abs/2311.04711v1
Date: Wed, 8 Nov 2023 14:38:10 GMT
Title: Training CLIP models on Data from Scientific Papers
Authors: Calvin Metzger
Abstract summary: Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.

Related papers

DataDream: Few-shot Guided Dataset Generation [90.09164461462365]
We propose a framework for synthesizing classification datasets that more faithfully represents the real data distribution. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets.
arXiv Detail & Related papers (2024-07-15T17:10:31Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models [12.497110441765274]
Existing methods for identity inference in CLIP models require querying the model with full PII. Applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model. We propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models.
arXiv Detail & Related papers (2024-05-23T12:54:25Z)
Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP [3.5999252362400993]
We study whether vision-language models can successfully classify images with novel compositions of attribute-object pairs. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.
arXiv Detail & Related papers (2024-03-27T12:59:44Z)
Scaling Laws of Synthetic Images for Model Training ... for Now [54.43596959598466]
We study the scaling laws of synthetic images generated by state of the art text-to-image models. We observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training.
arXiv Detail & Related papers (2023-12-07T18:59:59Z)
A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z)
An evaluation of pre-trained models for feature extraction in image classification [0.0]
This work aims to compare the performance of different pre-trained neural networks for feature extraction in image classification tasks. Our results demonstrate that the best general performance along the datasets was achieved by CLIP-ViT-B and ViT-H-14, where the CLIP-ResNet50 model had similar performance but with less variability.
arXiv Detail & Related papers (2023-10-03T13:28:14Z)
T-ADAF: Adaptive Data Augmentation Framework for Image Classification Network based on Tensor T-product Operator [0.0]
This paper proposes an Adaptive Data Augmentation Framework based on the tensor T-product Operator. It triples one image data to be trained and gain the result from all these three images together with only less than 0.1% increase in the number of parameters. Numerical experiments show that our data augmentation framework can improve the performance of original neural network model by 2%.
arXiv Detail & Related papers (2023-06-07T08:30:44Z)
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space. We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains. Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features. It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.