Related papers: VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

URL: http://arxiv.org/abs/2407.11691v1
Date: Tue, 16 Jul 2024 13:06:15 GMT
Title: VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Authors: Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen,
Abstract summary: We present an open-source toolkit for evaluating large multi-modality models based on PyTorch. VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models. We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
Score: 78.76009461738299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

Related papers

Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality [74.59049806800176]
This demo paper highlights the Tevatron toolkit's key features, bridging academia and industry.<n>We showcase a unified dense retriever achieving strong multilingual and multimodal effectiveness.<n>We also release OmniEmbed, to the best of our knowledge, the first embedding model that unifies text, image document, video, and audio retrieval.
arXiv Detail & Related papers (2025-05-05T08:52:49Z)
OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions [58.46747176834132]
We present OmniEvalKit, a novel benchmarking toolbox designed to evaluate Large Language Models (LLMs) Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit provides a modular, lightweight, and automated evaluation system. It is structured with a modular architecture comprising a Static Builder and Dynamic Data Flow, promoting the seamless integration of new models and datasets.
arXiv Detail & Related papers (2024-12-09T17:39:43Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models [1.3535643703577176]
MultiMedEval is an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM) It comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. We open-source a Python toolkit with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code.
arXiv Detail & Related papers (2024-02-14T15:49:08Z)
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models [51.35570730554632]
ESPnet-SPK is a toolkit for training speaker embedding extractors. We provide several models, ranging from x-vector to recent SKA-TDNN. We also aspire to bridge developed models with other domains.
arXiv Detail & Related papers (2024-01-30T18:18:27Z)
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals. UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning [110.54752872873472]
MultiZoo is a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms. MultiBench is a benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
arXiv Detail & Related papers (2023-06-28T17:59:10Z)
PiML Toolbox for Interpretable Machine Learning Model Development and Diagnostics [10.635578367440162]
PiML is an integrated and open-access Python toolbox for interpretable machine learning model development and model diagnostics. It is designed with machine learning in both low-code and high-code modes, including data pipeline, model training and tuning, model interpretation and explanation.
arXiv Detail & Related papers (2023-05-07T08:19:07Z)
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit [6.187270874122921]
We propose a toolkit for systematic multimodal VAE training and comparison. We present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities.
arXiv Detail & Related papers (2022-09-07T10:26:28Z)
DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval Models [5.725477071353354]
Cross-modal retrieval relies on accurate models to retrieve relevant results for queries across modalities such as image, text, and video. We present DIME, a modality-agnostic tool that handles multimodal datasets, trained models, and data preprocessors.
arXiv Detail & Related papers (2020-10-19T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.