Related papers: On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets

URL: http://arxiv.org/abs/2310.06594v2
Date: Sat, 30 Dec 2023 02:19:22 GMT
Title: On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets
Authors: Ning Liao, Shaofeng Zhang, Renqiu Xia, Min Cao, Yu Qiao, Junchi Yan
Abstract summary: We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
Score: 71.54954966652286
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not only facilitates the development of a powerful model but also incorporates an evaluation set, which is designed to serve as a convenient benchmark for future research in the field.

Related papers

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning [69.7347209018861]
We introduce MLLM-Selector, an automated approach that identifies valuable data for visual instruction tuning. We calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector.
arXiv Detail & Related papers (2025-03-26T12:42:37Z)
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models. We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results) Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations. They generate only a limited range of perturbations for a single Information Extraction (IE) task. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench. We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [41.4789135538612]
This paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples. Thanks to the advanced language understanding capabilities of Large Language Models (LLMs), we utilize LLMs to evaluate the value of each option during the selection process.
arXiv Detail & Related papers (2025-03-04T07:32:41Z)
DataSciBench: An LLM Agent Benchmark for Data Science [33.3811507234528]
DataSciBench is a benchmark for evaluating Large Language Model (LLM) capabilities in data science. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. We propose an innovative Task - Function - Code framework to assess each code execution outcome.
arXiv Detail & Related papers (2025-02-19T17:31:51Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item. We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z)
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z)
Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models. Our approach ensures statistically aligned model rankings compared to full datasets. We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z)
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models [28.44922164328789]
evaluation of text-to-image generative models is one essential step in the development process. We propose FlashEval, an iterative search algorithm tailored to evaluation data selection. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations.
arXiv Detail & Related papers (2024-03-25T02:53:32Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset. We measure consensus between answers generated by the model and a set of relevant answers. We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.