Related papers: LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

URL: http://arxiv.org/abs/2507.07274v1
Date: Wed, 09 Jul 2025 20:45:04 GMT
Title: LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation
Authors: Ananya Raval, Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza,
Abstract summary: We introduce a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task.<n>Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes.<n>We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness.
Score: 2.9248916859490173
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.

Related papers

Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish [12.286855282078305]
GPT-4o, GPT-4, Claude3.5Sonnet, LLaMA3.1, MistralLarge2, LLaMA-2Chat13B, and Mistral7BInstruct are evaluated.<n>Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue.
arXiv Detail & Related papers (2025-11-05T22:09:53Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
M-Prometheus: A Suite of Open Multilingual LLM Judges [64.22940792713713]
We introduce M-Prometheus, a suite of open-weight LLM judges that can provide both direct assessment and pairwise comparison feedback on multilingual outputs.<n>M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs.
arXiv Detail & Related papers (2025-04-07T11:37:26Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation [28.531581489405745]
mHumanEval is an extended benchmark supporting prompts in over 200 natural languages.<n>We provide expert human translations for 15 diverse natural languages (NLs)<n>We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs.
arXiv Detail & Related papers (2024-10-19T08:44:26Z)
M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks [10.677274746850554]
M5 is the first comprehensive benchmark designed to evaluate LMMs on diverse vision-modal tasks within a multilingual context. We highlight substantial task-agnostic performance disparities between high- and low-resource languages. We show that larger models do not necessarily outperform smaller ones in a multilingual setting.
arXiv Detail & Related papers (2024-07-04T09:55:04Z)
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.083861654053585]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages.<n>We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and six proprietary language models.<n>We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z)
MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis [1.5761916307614148]
We propose the first benchmark of sentence embeddings for French. We compare 51 carefully selected embedding models on a large scale. We find that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well.
arXiv Detail & Related papers (2024-05-30T20:34:37Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.