COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
- URL: http://arxiv.org/abs/2401.00849v1
- Date: Mon, 1 Jan 2024 18:58:42 GMT
- Title: COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
- Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin
Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
- Abstract summary: Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
- Score: 119.03392147066093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text
comprehension to encompassing extended textual contexts is pivotal. Recent
autoregressive vision-language models like \cite{flamingo, palme}, leveraging
the long-context capability of Large Language Models, have excelled in few-shot
text generation tasks but face challenges in alignment tasks. Addressing this
gap, we introduce the contrastive loss into text generation models, presenting
the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically
partitioning the language model into dedicated unimodal text processing and
adept multimodal data handling components. \ModelName, our unified framework,
merges unimodal and multimodal elements, enhancing model performance for tasks
involving textual and visual data while notably reducing learnable parameters.
However, these models demand extensive long-text datasets, yet the availability
of high-quality long-text video datasets remains limited. To bridge this gap,
this work introduces \VideoDatasetName, an inaugural interleaved video-text
dataset featuring comprehensive captions, marking a significant step forward.
Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model
performance in image-text tasks. With 34% learnable parameters and utilizing
72\% of the available data, our model demonstrates significant superiority over
OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning
task, performance notably improves from 57.2% to 65.\%. The contributions of
\ModelName{} and \VideoDatasetName{} are underscored by notable performance
gains across 14 diverse downstream datasets encompassing both image-text and
video-text tasks.
Related papers
- Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text.
We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z) - Harmonizing Visual Text Comprehension and Generation [31.605599298507293]
We present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text.
We propose Slide-LoRA, which aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space.
Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-07-23T10:11:56Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.