Related papers: What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

URL: http://arxiv.org/abs/2404.02415v1
Date: Wed, 3 Apr 2024 02:40:35 GMT
Title: What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
Authors: Anthony Meng Huat Tiong, Junqi Zhao, Boyang Li, Junnan Li, Steven C. H. Hoi, Caiming Xiong,
Abstract summary: We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data. We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
Score: 87.65903426052155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.

Related papers

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems. Current assessment methods rely on AI-annotated preference labels from traditional tasks. We introduce VL-RewardBench, a benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z)
Active Prompt Learning with Vision-Language Model Priors [9.173468790066956]
We introduce a class-guided clustering that leverages the pre-trained image and text encoders of vision-language models. We propose a budget-saving selective querying based on adaptive class-wise thresholds.
arXiv Detail & Related papers (2024-11-23T02:34:33Z)
Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead [1.3995965887921709]
We analyze demographic biases across five models and six datasets. Portrait datasets like UTKFace and CelebA are the best tools for bias detection. We introduce a more difficult version of VisoGender to serve as a more rigorous evaluation.
arXiv Detail & Related papers (2024-10-17T02:03:27Z)
VisEval: A Benchmark for Data Visualization in the Era of Large Language Models [12.077276008688065]
Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. In this paper, we propose a new NL2VIS benchmark called VisEval. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths.
arXiv Detail & Related papers (2024-07-01T05:35:30Z)
Vision-Language Instruction Tuning: A Review and Analysis [52.218690619616474]
Vision-Language Instruction Tuning (VLIT) presents more complex characteristics compared to pure text instruction tuning. We offer a detailed categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess. By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs.
arXiv Detail & Related papers (2023-11-14T14:02:32Z)
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE) For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z)
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z)
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks. We also introduce e-SNLI-VE, the largest existing dataset with NLEs. We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation. Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.