Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond
- URL: http://arxiv.org/abs/2308.12966v3
- Date: Fri, 13 Oct 2023 02:41:28 GMT
- Title: Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond
- Authors: Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng
Wang, Junyang Lin, Chang Zhou, Jingren Zhou
- Abstract summary: We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs)
We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.
The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
- Score: 72.41822115096741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce the Qwen-VL series, a set of large-scale
vision-language models (LVLMs) designed to perceive and understand both texts
and images. Starting from the Qwen-LM as a foundation, we endow it with visual
capacity by the meticulously designed (i) visual receptor, (ii) input-output
interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal
cleaned corpus. Beyond the conventional image description and
question-answering, we implement the grounding and text-reading ability of
Qwen-VLs by aligning image-caption-box tuples. The resulting models, including
Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar
model scales on a broad range of visual-centric benchmarks (e.g., image
captioning, question answering, visual grounding) and different settings (e.g.,
zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our
instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to
existing vision-language chatbots. Code, demo and models are available at
https://github.com/QwenLM/Qwen-VL.
Related papers
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing.
Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens.
The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - CogVLM: Visual Expert for Pretrained Language Models [56.69978233342978]
We introduce CogVLM, a powerful open-source visual language foundation model.
CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.
arXiv Detail & Related papers (2023-11-06T13:04:39Z) - MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action
Recognition with Language Knowledge [35.45809761628721]
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities.
We propose an unsupervised approach to tuning video data for best zero-shot action recognition performance.
Our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks.
arXiv Detail & Related papers (2023-03-15T20:17:41Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Crossing the Format Boundary of Text and Boxes: Towards Unified
Vision-Language Modeling [50.370767959977506]
UNICORN is a vision-language model that unifies text generation and bounding box prediction into a single architecture.
We formulate all vision-language problems as a generation task, where the target sequence consists of the integrated text and box tokens.
With such a unified framework and input-output format, UNICORN achieves comparable performance to task-specific state of the art on 7 VL benchmarks.
arXiv Detail & Related papers (2021-11-23T18:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.