UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
- URL: http://arxiv.org/abs/2408.04810v1
- Date: Fri, 9 Aug 2024 01:41:05 GMT
- Title: UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
- Authors: Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane Bouchacourt, Caner Hazirbas, Mark Ibrahim,
- Abstract summary: We introduce UniBench, a unified implementation of 50+ vision-language model (VLM) benchmarks.
We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models.
We also release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled set of benchmarks that runs in 5 minutes on a single GPU.
- Score: 22.885385107905222
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.
Related papers
- Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing [19.577278316436807]
Large Language Models (LLMs) are limited by the context window size.
We propose a novel method that leverages the LLMs's own attention information to enable accurate retrieval.
InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model.
arXiv Detail & Related papers (2025-02-18T15:45:36Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM [13.324171480106715]
EvoEval is a program synthesis benchmark suite created by evolving existing benchmarks into different targeted domains.
Our study shows that compared to the high performance obtained on standard benchmarks like HumanEval, there is a significant drop in performance.
We showcase various insights, including the brittleness of instruction-following models when encountering rewording or subtle changes.
arXiv Detail & Related papers (2024-03-28T03:10:39Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision
Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks.
Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth.
Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.