Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning
- URL: http://arxiv.org/abs/2310.08166v3
- Date: Tue, 31 Oct 2023 17:51:51 GMT
- Title: Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning
- Authors: Junyu Lu, Dixiang Zhang, Xiaojun Wu, Xinyu Gao, Ruyi Gan, Jiaxing
Zhang, Yan Song, Pingjian Zhang
- Abstract summary: We introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs)
Our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes.
In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese.
- Score: 27.544311403607786
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements enlarge the capabilities of large language models (LLMs)
in zero-shot image-to-text generation and understanding by integrating
multi-modal inputs. However, such success is typically limited to English
scenarios due to the lack of large-scale and high-quality non-English
multi-modal resources, making it extremely difficult to establish competitive
counterparts in other languages. In this paper, we introduce the Ziya-Visual
series, a set of bilingual large-scale vision-language models (LVLMs) designed
to incorporate visual semantics into LLM for multi-modal dialogue. Composed of
Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying
Transformer from BLIP-2, further exploring the assistance of optimization
schemes such as instruction tuning, multi-stage training and low-rank
adaptation module for visual-language alignment. In addition, we stimulate the
understanding ability of GPT-4 in multi-modal scenarios, translating our
gathered English image-text datasets into Chinese and generating
instruction-response through the in-context learning method. The experiment
results demonstrate that compared to the existing LVLMs, Ziya-Visual achieves
competitive performance across a wide range of English-only tasks including
zero-shot image-text retrieval, image captioning, and visual question
answering. The evaluation leaderboard accessed by GPT-4 also indicates that our
models possess satisfactory image-text understanding and generation
capabilities in Chinese multi-modal scenario dialogues. Code, demo and models
are available at
~\url{https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1}.
Related papers
- ICU: Conquering Language Barriers in Vision-and-Language Modeling by
Dividing the Tasks into Image Captioning and Language Understanding [1.9906814758497542]
ICU, Image Caption Understanding, divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM) takes the caption as the alt text and performs cross-lingual language understanding.
We show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
arXiv Detail & Related papers (2023-10-19T07:11:48Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model.
Our model achieves new levels of performance on a wide-range of varied and complex tasks.
We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.