Related papers: mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

URL: http://arxiv.org/abs/2307.06930v3
Date: Thu, 20 Jun 2024 06:48:17 GMT
Title: mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Authors: Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš,
Abstract summary: Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
Score: 50.17767479660832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.

Related papers

Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models [17.444066202370397]
Cross-lingual transfer enables vision-language models to perform vision tasks in various languages with training data only in one language. Current approaches rely on large pre-trained multilingual language models. We propose Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters combining the pre-trained VLM Florence-2 and the large language model Gemma-2.
arXiv Detail & Related papers (2025-03-12T14:41:10Z)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE. It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z)
Liquid: Language Models are Scalable and Unified Multi-modal Generators [112.71734051183726]
Liquid is an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks.
arXiv Detail & Related papers (2024-12-05T16:48:16Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
We introduce mOSCAR, the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. It shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks.
arXiv Detail & Related papers (2024-06-13T00:13:32Z)
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts. Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation. We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z)
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks. LLMs often struggle to perform well on low-resource languages because there is so little training data available. In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z)
Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities. We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF) We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks [38.05496300873095]
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. We propose to learn multi-grained vision language alignments by a unified pre-training framework. X$2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions.
arXiv Detail & Related papers (2022-11-22T16:48:01Z)
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training. We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.