Related papers: M$\mathbf5$ -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

M$\mathbf5$ -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks

URL: http://arxiv.org/abs/2407.03791v1
Date: Thu, 4 Jul 2024 09:55:04 GMT
Title: M$\mathbf5$ -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
Authors: Florian Schneider, Sunayana Sitaram,
Abstract summary: M5 is the first comprehensive benchmark designed to evaluate LMMs on diverse vision-modal tasks within a multilingual context. We highlight substantial task-agnostic performance disparities between high- and low-resource languages. We show that larger models do not necessarily outperform smaller ones in a multilingual setting.
Score: 10.677274746850554
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs). Despite their impressive capabilities, LLMs often exhibit significant performance disparities across different languages and cultural contexts, as demonstrated by various text-only benchmarks. However, current research lacks such benchmarks for multimodal visio-linguistic settings. This work fills this gap by introducing M5, the first comprehensive benchmark designed to evaluate LMMs on diverse vision-language tasks within a multilingual and multicultural context. M5 includes eight datasets covering five tasks and $41$ languages, with a focus on underrepresented languages and culturally diverse images. Furthermore, we introduce two novel datasets, M5-VGR and M5-VLOD, including a new Visio-Linguistic Outlier Detection task, in which all evaluated open-source models fail to significantly surpass the random baseline. Through extensive evaluation and analyses, we highlight substantial task-agnostic performance disparities between high- and low-resource languages. Moreover, we show that larger models do not necessarily outperform smaller ones in a multilingual setting.

Related papers

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks [25.75895667904485]
We introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks.<n>MCF spans three core modalities--speech, vision, and text--and four diverse languages (English, German, Italian, and Chinese)<n>It enables a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information.
arXiv Detail & Related papers (2025-07-25T19:00:51Z)
Multi-Agent Multimodal Models for Multicultural Text to Image Generation [4.934817254755008]
We evaluate the performance of Large Language Models (LLMs) in a multi-agent interaction setting for the novel task of multicultural image generation. We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics.
arXiv Detail & Related papers (2025-02-21T22:19:56Z)
Exploring Vision Language Models for Multimodal and Multilingual Stance Detection [9.079302402271491]
Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios relatively underexplored. This paper evaluates state-of-the-art Vision-Language Models (VLMs) on multimodal and multilingual stance detection tasks.
arXiv Detail & Related papers (2025-01-29T13:39:53Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0. InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z)
E5-V: Universal Embeddings with Multimodal Large Language Models [51.5978154046302]
We introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs. E5-V achieves strong performance in multimodal embeddings even without fine-tuning.
arXiv Detail & Related papers (2024-07-17T14:04:12Z)
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z)
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models [27.18427414844769]
We introduce M4U, a novel benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools.
arXiv Detail & Related papers (2024-05-24T15:25:28Z)
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.