mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
- URL: http://arxiv.org/abs/2307.06930v3
- Date: Thu, 20 Jun 2024 06:48:17 GMT
- Title: mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
- Authors: Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš,
- Abstract summary: Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
- Score: 50.17767479660832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.
Related papers
- HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE.
It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts.
Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z) - Liquid: Language Models are Scalable Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation.
Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model.
For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z) - mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
We introduce mOSCAR, the first large-scale multilingual and multimodal document corpus crawled from the web.
It covers 163 languages, 315M documents, 214B tokens and 1.2B images.
It shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks.
arXiv Detail & Related papers (2024-06-13T00:13:32Z) - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts.
Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation.
We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.