On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets
- URL: http://arxiv.org/abs/2310.06594v2
- Date: Sat, 30 Dec 2023 02:19:22 GMT
- Title: On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets
- Authors: Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan
- Abstract summary: We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
- Score: 71.54954966652286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is an emerging line of research on multimodal instruction tuning, and a
line of benchmarks has been proposed for evaluating these models recently.
Instead of evaluating the models directly, in this paper, we try to evaluate
the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way
of building a dataset for developing an all-powerful VLIT model, which we
believe could also be of utility for establishing a grounded protocol for
benchmarking VLIT models. For effective evaluation of VLIT datasets that
remains an open question, we propose a tune-cross-evaluation paradigm: tuning
on one dataset and evaluating on the others in turn. For each single
tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean
score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L
to quantify the quality of a certain dataset or a sample. On this basis, to
evaluate the comprehensiveness of a dataset, we develop the Dataset Quality
(DQ) covering all tune-evaluation sets. To lay the foundation for building a
comprehensive dataset and developing an all-powerful model for practical
applications, we define the Sample Quality (SQ) to quantify the all-sided
quality of each sample. Extensive experiments validate the rationality of the
proposed evaluation paradigm. Based on the holistic evaluation, we build a new
dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting
samples with higher SQ from each dataset. Remarkably, even with only half of
the complete data, the model trained on REVO-LION can achieve the performance
comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not
only facilitates the development of a powerful model but also incorporates an
evaluation set, which is designed to serve as a convenient benchmark for future
research in the field.
Related papers
- Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models.
Our approach ensures statistically aligned model rankings compared to full datasets.
We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z) - An Optimism-based Approach to Online Evaluation of Generative Models [23.91197677628145]
We propose an online evaluation framework to find the generative model that maximizes a standard assessment score among a group of available models.
Specifically, we study the online assessment of generative models based on the Fr'echet Inception Distance (FID) and Inception Score (IS) metrics.
arXiv Detail & Related papers (2024-06-11T16:57:48Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models [28.44922164328789]
evaluation of text-to-image generative models is one essential step in the development process.
We propose FlashEval, an iterative search algorithm tailored to evaluation data selection.
Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations.
arXiv Detail & Related papers (2024-03-25T02:53:32Z) - OMNIINPUT: A Model-centric Evaluation Framework through Output
Distribution [31.00645110294068]
We propose a model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs.
We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model.
Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models.
arXiv Detail & Related papers (2023-12-06T04:53:12Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.