CogVLM: Visual Expert for Pretrained Language Models
- URL: http://arxiv.org/abs/2311.03079v2
- Date: Sun, 4 Feb 2024 08:23:04 GMT
- Title: CogVLM: Visual Expert for Pretrained Language Models
- Authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang,
Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi
Li, Yuxiao Dong, Ming Ding, Jie Tang
- Abstract summary: We introduce CogVLM, a powerful open-source visual language foundation model.
CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.
- Score: 56.69978233342978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce CogVLM, a powerful open-source visual language foundation model.
Different from the popular shallow alignment method which maps image features
into the input space of language model, CogVLM bridges the gap between the
frozen pretrained language model and image encoder by a trainable visual expert
module in the attention and FFN layers. As a result, CogVLM enables deep fusion
of vision language features without sacrificing any performance on NLP tasks.
CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal
benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+,
RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on
VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X
55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.
Related papers
- CogVLM2: Visual Language Models for Image and Video Understanding [69.361109860391]
We propose the CogVLM2 family, a new generation of visual language models for image and video understanding.
As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages.
As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction.
arXiv Detail & Related papers (2024-08-29T12:59:12Z) - Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond [72.41822115096741]
We introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs)
We endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus.
The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales.
arXiv Detail & Related papers (2023-08-24T17:59:17Z) - Bootstrapping Vision-Language Learning with Decoupled Language
Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training.
Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features.
Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.