ReForm-Eval: Evaluating Large Vision Language Models via Unified
Re-Formulation of Task-Oriented Benchmarks
- URL: http://arxiv.org/abs/2310.02569v2
- Date: Tue, 17 Oct 2023 08:11:15 GMT
- Title: ReForm-Eval: Evaluating Large Vision Language Models via Unified
Re-Formulation of Task-Oriented Benchmarks
- Authors: Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang,
Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen, Xuanjing Huang, Zhongyu
Wei
- Abstract summary: Large vision-language models (LVLMs) exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning.
Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.
- Score: 76.25209974199274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed remarkable progress in the development of large
vision-language models (LVLMs). Benefiting from the strong language backbones
and efficient cross-modal alignment strategies, LVLMs exhibit surprising
capabilities to perceive visual signals and perform visually grounded
reasoning. However, the capabilities of LVLMs have not been comprehensively and
quantitatively evaluate. Most existing multi-modal benchmarks require
task-oriented input-output formats, posing great challenges to automatically
assess the free-form text output of LVLMs. To effectively leverage the
annotations available in existing benchmarks and reduce the manual effort
required for constructing new benchmarks, we propose to re-formulate existing
benchmarks into unified LVLM-compatible formats. Through systematic data
collection and reformulation, we present the ReForm-Eval benchmark, offering
substantial data for evaluating various capabilities of LVLMs. Based on
ReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths
and weaknesses of existing LVLMs, and identify the underlying factors. Our
benchmark and evaluation framework will be open-sourced as a cornerstone for
advancing the development of LVLMs.
Related papers
- Transferring Textual Preferences to Vision-Language Understanding through Model Merging [65.41765072566287]
This paper explores a training-free alternative by merging text-based reward models (RMs) with large vision-language models (LVLMs)
Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs.
arXiv Detail & Related papers (2025-02-19T07:20:07Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.
Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware [4.480157114854711]
We present FVEval, the first comprehensive benchmark for characterizing large language models (LLMs) performance in tasks pertaining to formal verification (FV)
The benchmark consists of three sub-tasks that measure LLM capabilities at different levels.
We present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with FV.
arXiv Detail & Related papers (2024-10-15T21:48:57Z) - Large Vision-Language Models as Emotion Recognizers in Context Awareness [14.85890824622433]
Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues.
Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images.
This paper systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task.
arXiv Detail & Related papers (2024-07-16T01:28:06Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning.
We propose a novel method, termed "reflection-tuning"
This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.