Why do LLaVA Vision-Language Models Reply to Images in English?
- URL: http://arxiv.org/abs/2407.02333v1
- Date: Tue, 2 Jul 2024 15:01:55 GMT
- Title: Why do LLaVA Vision-Language Models Reply to Images in English?
- Authors: Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shaoyen Tseng, Vasudev Lal,
- Abstract summary: We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs)
Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query.
- Score: 15.727116803057633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablation of the design space with a mechanistic analysis of the models' internal representations of image and text inputs. Both approaches indicate that the issue stems in the language modelling component of the LLaVA model. Statistically, we find that switching the language backbone for a bilingual language model has the strongest effect on reducing this error. Mechanistically, we provide compelling evidence that visual inputs are not mapped to a similar space as text ones, and that intervening on intermediary attention layers can reduce this bias. Our findings provide important insights to researchers and engineers seeking to understand the crossover between multimodal and multilingual spaces, and contribute to the goal of developing capable and inclusive VLMs for non-English contexts.
Related papers
- Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Could We Have Had Better Multilingual LLMs If English Was Not the Central Language? [4.655168524016426]
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on.
Our study delves into Llama2's translation capabilities.
Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
arXiv Detail & Related papers (2024-02-21T16:32:38Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - ICU: Conquering Language Barriers in Vision-and-Language Modeling by
Dividing the Tasks into Image Captioning and Language Understanding [1.9906814758497542]
ICU, Image Caption Understanding, divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM) takes the caption as the alt text and performs cross-lingual language understanding.
We show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
arXiv Detail & Related papers (2023-10-19T07:11:48Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - Searching for Needles in a Haystack: On the Role of Incidental
Bilingualism in PaLM's Translation Capability [16.01088313166145]
We investigate the role of incidental bilingualism in large language models.
We show that PaLM is exposed to over 30 million translation pairs across at least 44 languages.
We show that its presence has a substantial impact on translation capabilities, although this impact diminishes with model scale.
arXiv Detail & Related papers (2023-05-17T14:58:06Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.