Leveraging Image-Text Similarity and Caption Modification for the
DataComp Challenge: Filtering Track and BYOD Track
- URL: http://arxiv.org/abs/2310.14581v1
- Date: Mon, 23 Oct 2023 05:40:43 GMT
- Title: Leveraging Image-Text Similarity and Caption Modification for the
DataComp Challenge: Filtering Track and BYOD Track
- Authors: Shuhei Yokoo, Peifei Zhu, Yuchi Ishikawa, Mikihiro Tanaka, Masayoshi
Kondo, Hirokatsu Kataoka
- Abstract summary: This paper presents our solution to both filtering track and BYOD track of the DataComp challenge.
Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality.
- Score: 9.474587055642312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large web crawl datasets have already played an important role in learning
multimodal features with high generalization capabilities. However, there are
still very limited studies investigating the details or improvements of data
design. Recently, a DataComp challenge has been designed to propose the best
training data with the fixed models. This paper presents our solution to both
filtering track and BYOD track of the DataComp challenge. Our solution adopts
large multimodal models CLIP and BLIP-2 to filter and modify web crawl data,
and utilize external datasets along with a bag of tricks to improve the data
quality. Experiments show our solution significantly outperforms DataComp
baselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement).
Related papers
- Rethinking Data Selection at Scale: Random Selection is Almost All You Need [39.14807071480125]
Supervised fine-tuning is crucial for aligning Large Language Models with human instructions.
Most existing data selection techniques are designed for small-scale data pools.
arXiv Detail & Related papers (2024-10-12T02:48:34Z) - Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - RINAS: Training with Dataset Shuffling Can Be General and Fast [2.485503195398027]
RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets.
We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision.
Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
arXiv Detail & Related papers (2023-12-04T21:50:08Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data
Filtering [23.68112988933411]
This paper describes our learning and solution when participating in the DataComp challenge.
Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment.
Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
arXiv Detail & Related papers (2023-09-27T19:10:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.