On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets
- URL: http://arxiv.org/abs/2310.06594v2
- Date: Sat, 30 Dec 2023 02:19:22 GMT
- Title: On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets
- Authors: Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan
- Abstract summary: We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
- Score: 71.54954966652286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an emerging line of research on multimodal instruction tuning, and a
line of benchmarks has been proposed for evaluating these models recently.
Instead of evaluating the models directly, in this paper, we try to evaluate
the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way
of building a dataset for developing an all-powerful VLIT model, which we
believe could also be of utility for establishing a grounded protocol for
benchmarking VLIT models. For effective evaluation of VLIT datasets that
remains an open question, we propose a tune-cross-evaluation paradigm: tuning
on one dataset and evaluating on the others in turn. For each single
tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean
score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L
to quantify the quality of a certain dataset or a sample. On this basis, to
evaluate the comprehensiveness of a dataset, we develop the Dataset Quality
(DQ) covering all tune-evaluation sets. To lay the foundation for building a
comprehensive dataset and developing an all-powerful model for practical
applications, we define the Sample Quality (SQ) to quantify the all-sided
quality of each sample. Extensive experiments validate the rationality of the
proposed evaluation paradigm. Based on the holistic evaluation, we build a new
dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting
samples with higher SQ from each dataset. Remarkably, even with only half of
the complete data, the model trained on REVO-LION can achieve the performance
comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not
only facilitates the development of a powerful model but also incorporates an
evaluation set, which is designed to serve as a convenient benchmark for future
research in the field.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item.
We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z) - PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations.
Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z) - Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models.
Our approach ensures statistically aligned model rankings compared to full datasets.
We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models [28.44922164328789]
evaluation of text-to-image generative models is one essential step in the development process.
We propose FlashEval, an iterative search algorithm tailored to evaluation data selection.
Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations.
arXiv Detail & Related papers (2024-03-25T02:53:32Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.