Related papers: OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

URL: http://arxiv.org/abs/2412.06693v1
Date: Mon, 09 Dec 2024 17:39:43 GMT
Title: OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions
Authors: Yi-Kai Zhang, Xu-Xiang Zhong, Shiyin Lu, Qing-Guo Chen, De-Chuan Zhan, Han-Jia Ye,
Abstract summary: We present OmniEvalKit, a novel benchmarking toolbox designed to evaluate Large Language Models (LLMs)<n>Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system.<n>It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets.
Score: 58.46747176834132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancements in Large Language Models (LLMs) have significantly expanded their applications, ranging from multilingual support to domain-specific tasks and multimodal integration. In this paper, we present OmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their omni-extensions across multilingual, multidomain, and multimodal capabilities. Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system. It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets. OmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering comprehensive evaluations across thousands of model-dataset combinations. OmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable evaluation framework, making downstream applications more convenient and versatile for the AI community.

Related papers

OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts. We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
From LIMA to DeepLIMA: following a new path of interoperability [2.5764171991553795]
We describe the architecture of the LIMA framework and its recent evolution with the addition of new text analysis modules based on deep neural networks. Models were trained for more than 60 languages on the Universal Dependencies 2.5 corpora, WikiNer corpora, and CoNLL-03 dataset. This integration of ubiquitous Deep Learning Natural Language Processing models and the use of standard annotated collections can be viewed as a new path of interoperability.
arXiv Detail & Related papers (2024-09-10T14:26:12Z)
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models [89.63342806812413]
We present an open-source toolkit for evaluating large multi-modality models based on PyTorch. VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models. We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
arXiv Detail & Related papers (2024-07-16T13:06:15Z)
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.