What Is Missing in Multilingual Visual Reasoning and How to Fix It
- URL: http://arxiv.org/abs/2403.01404v2
- Date: Sun, 09 Feb 2025 15:41:33 GMT
- Title: What Is Missing in Multilingual Visual Reasoning and How to Fix It
- Authors: Yueqi Song, Simran Khanuja, Graham Neubig,
- Abstract summary: We evaluate NLP models' multilingual, multimodal capabilities by testing on a visual reasoning task.
proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison.
Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open models LLaVA-v1.5-13B by 13.4%, LLaVA-v1.6-34B by 20.3%, and Qwen-VL by 16.7%.
- Score: 57.37123046817781
- License:
- Abstract: NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open models LLaVA-v1.5-13B by 13.4%, LLaVA-v1.6-34B by 20.3%, and Qwen-VL by 16.7%, while also minorly improving GPT-4V's performance.
Related papers
- Test-Time Code-Switching for Cross-lingual Aspect Sentiment Triplet Extraction [12.269762062755492]
We introduce a novel Test-Time Code-SWitching (TT-CSW) framework to bridge the gap between the bilingual training phase and the monolingual test-time prediction.
During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs.
In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation.
arXiv Detail & Related papers (2025-01-24T00:00:51Z) - Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model [66.17354128553244]
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data.
We investigate how different training mixes tip the scale for different groups of languages.
We train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
arXiv Detail & Related papers (2025-01-09T10:26:14Z) - SandboxAQ's submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval [1.2629889435114405]
This paper explores the problems of Question Answering (QA) and Named Entity Recognition (NER) in five diverse languages.
We tested five Large Language Models with various prompting methods, including zero-shot, chain-of-thought reasoning, and translation techniques.
Our results show that while some models consistently outperform others, their effectiveness varies significantly across tasks and languages.
arXiv Detail & Related papers (2024-10-28T20:15:45Z) - Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions.
Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks.
languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z) - M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks [10.677274746850554]
M5 is the first comprehensive benchmark designed to evaluate LMMs on diverse vision-modal tasks within a multilingual context.
We highlight substantial task-agnostic performance disparities between high- and low-resource languages.
We show that larger models do not necessarily outperform smaller ones in a multilingual setting.
arXiv Detail & Related papers (2024-07-04T09:55:04Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.