Related papers: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

URL: http://arxiv.org/abs/2501.00958v3
Date: Mon, 27 Jan 2025 18:17:26 GMT
Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Authors: Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing,
Abstract summary: We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining.<n>It collects over 2.5 years of instructional videos, totaling 22,000 class hours.<n>Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
Score: 86.76706820098867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.

Related papers

Improving Fine-grained Visual Understanding in VLMs through Text-Only Training [0.0]
We investigate the feasibility of enhancing fine-grained visual understanding in Visual-Language Models (VLMs) through text-only training. We conduct comprehensive experiments on two distinct domains: fine-grained species classification and cultural visual understanding tasks. Our findings demonstrate that text-only training can be comparable to conventional image-text training while significantly reducing computational costs.
arXiv Detail & Related papers (2024-12-17T14:18:50Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
HawkEye: Training Video-Text LLMs for Grounding Text in Videos [44.870165050047355]
We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
arXiv Detail & Related papers (2024-03-15T11:58:18Z)
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text. Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z)
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs. We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.