Automatic benchmarking of large multimodal models via iterative experiment programming
- URL: http://arxiv.org/abs/2406.12321v1
- Date: Tue, 18 Jun 2024 06:43:46 GMT
- Title: Automatic benchmarking of large multimodal models via iterative experiment programming
- Authors: Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci,
- Abstract summary: We present APEx, the first framework for automatic benchmarking of LMMs.
Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand.
The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
- Score: 71.78089106671581
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - An Empirical Study of Unit Test Generation with Large Language Models [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components.
The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Synthetic Test Collections for Retrieval Evaluation [31.36035082257619]
Test collections play a vital role in evaluation of information retrieval (IR) systems.
We investigate whether it is possible to use Large Language Models (LLMs) to construct synthetic test collections.
Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.
arXiv Detail & Related papers (2024-05-13T14:11:09Z) - Automating REST API Postman Test Cases Using LLM [0.0]
This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases using Large Language Models.
The methodology integrates the use of Open AI to enhance the efficiency and effectiveness of test case generation.
The model that is developed during the research is trained using manually collected postman test cases or instances for various Rest APIs.
arXiv Detail & Related papers (2024-04-16T15:53:41Z) - System for systematic literature review using multiple AI agents:
Concept and an empirical evaluation [5.194208843843004]
We introduce a novel multi-AI agent model designed to fully automate the process of conducting Systematic Literature Reviews.
The model operates through a user-friendly interface where researchers input their topic.
It generates a search string used to retrieve relevant academic papers.
The model then autonomously summarizes the abstracts of these papers.
arXiv Detail & Related papers (2024-03-13T10:27:52Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Automatic Evaluation of Attribution by Large Language Models [24.443271739599194]
We investigate the automatic evaluation of attribution given by large language models (LLMs)
We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation.
We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing.
arXiv Detail & Related papers (2023-05-10T16:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.