VisIT-Bench: A Benchmark for Vision-Language Instruction Following
Inspired by Real-World Use
- URL: http://arxiv.org/abs/2308.06595v4
- Date: Tue, 26 Dec 2023 15:57:47 GMT
- Title: VisIT-Bench: A Benchmark for Vision-Language Instruction Following
Inspired by Real-World Use
- Authors: Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu,
Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schmidt
- Abstract summary: VisIT-Bench is a benchmark for evaluation of instruction-following vision-language models.
Our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption.
We quantify quality gaps between models and references using both human and automatic evaluations.
- Score: 49.574651930395305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for
evaluation of instruction-following vision-language models for real-world use.
Our starting point is curating 70 'instruction families' that we envision
instruction tuned vision-language models should be able to address. Extending
beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to
game playing and creative generation. Following curation, our dataset comprises
592 test queries, each with a human-authored instruction-conditioned caption.
These descriptions surface instruction-specific factors, e.g., for an
instruction asking about the accessibility of a storefront for wheelchair
users, the instruction-conditioned caption describes ramps/potential obstacles.
These descriptions enable 1) collecting human-verified reference outputs for
each instance; and 2) automatic evaluation of candidate multimodal generations
using a text-only LLM, aligning with human judgment. We quantify quality gaps
between models and references using both human and automatic evaluations; e.g.,
the top-performing instruction-following model wins against the GPT-4 reference
in just 27% of the comparison. VisIT-Bench is dynamic to participate,
practitioners simply submit their model's response on the project website;
Data, code and leaderboard is available at visit-bench.github.io.
Related papers
- Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation [2.4889060833127665]
In this paper, we focus on evaluating the instruction-following ability of Large Language Models (LLMs) in the context of story-ending generation.
We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction.
arXiv Detail & Related papers (2024-06-24T06:53:36Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - Instruction-Following Evaluation for Large Language Models [52.90926820437014]
We introduce Instruction-Following Eval (IFEval) for large language models.
IFEval is a straightforward and easy-to-reproduce evaluation benchmark.
We show evaluation results of two widely available LLMs on the market.
arXiv Detail & Related papers (2023-11-14T05:13:55Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - MIMIC-IT: Multi-Modal In-Context Instruction Tuning [44.879418596312554]
We present a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos.
Using the MIMIC-IT dataset, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning.
We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Benchmarking Generalization via In-Context Instructions on 1,600+
Language Tasks [95.06087720086133]
Natural-Instructions v2 is a collection of 1,600+ diverse language tasks and their expert written instructions.
The benchmark covers 70+ distinct task types, such as tagging, in-filling, and rewriting.
This benchmark enables large-scale evaluation of cross-task generalization of the models.
arXiv Detail & Related papers (2022-04-16T03:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.