What Is Missing in Multilingual Visual Reasoning and How to Fix It
- URL: http://arxiv.org/abs/2403.01404v1
- Date: Sun, 3 Mar 2024 05:45:27 GMT
- Title: What Is Missing in Multilingual Visual Reasoning and How to Fix It
- Authors: Yueqi Song, Simran Khanuja, Graham Neubig
- Abstract summary: We evaluate NLP models' multilingual, multimodal capabilities by testing on a visual reasoning task.
proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison.
Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA by 13.4%.
- Score: 64.47951359580556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: NLP models today strive for supporting multiple languages and modalities,
improving accessibility for diverse users. In this paper, we evaluate their
multilingual, multimodal capabilities by testing on a visual reasoning task. We
observe that proprietary systems like GPT-4V obtain the best performance on
this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits
similar performance between English and other languages, indicating the
potential for equitable system development across languages. Our analysis on
model failures reveals three key aspects that make this task challenging:
multilinguality, complex reasoning, and multimodality. To address these
challenges, we propose three targeted interventions including a translate-test
approach to tackle multilinguality, a visual programming approach to break down
complex reasoning, and a novel method that leverages image captioning to
address multimodality. Our interventions achieve the best open performance on
this task in a zero-shot setting, boosting open model LLaVA by 13.4%, while
also minorly improving GPT-4V's performance.
Related papers
- SandboxAQ's submission to MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval [1.2629889435114405]
This paper explores the problems of Question Answering (QA) and Named Entity Recognition (NER) in five diverse languages.
We tested five Large Language Models with various prompting methods, including zero-shot, chain-of-thought reasoning, and translation techniques.
Our results show that while some models consistently outperform others, their effectiveness varies significantly across tasks and languages.
arXiv Detail & Related papers (2024-10-28T20:15:45Z) - Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions.
Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks.
languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z) - M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks [10.677274746850554]
M5 is the first comprehensive benchmark designed to evaluate LMMs on diverse vision-modal tasks within a multilingual context.
We highlight substantial task-agnostic performance disparities between high- and low-resource languages.
We show that larger models do not necessarily outperform smaller ones in a multilingual setting.
arXiv Detail & Related papers (2024-07-04T09:55:04Z) - Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech
Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages.
Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2.
Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.