Long-range Multimodal Pretraining for Movie Understanding
- URL: http://arxiv.org/abs/2308.09775v1
- Date: Fri, 18 Aug 2023 18:52:59 GMT
- Title: Long-range Multimodal Pretraining for Movie Understanding
- Authors: Dawit Mureja Argaw, Joon-Young Lee, Markus Woodson, In So Kweon,
Fabian Caba Heilbron
- Abstract summary: We introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders.
Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range.
Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works.
- Score: 79.63187251571391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning computer vision models from (and for) movies has a long-standing
history. While great progress has been attained, there is still a need for a
pretrained multimodal model that can perform well in the ever-growing set of
movie understanding tasks the community has been establishing. In this work, we
introduce Long-range Multimodal Pretraining, a strategy, and a model that
leverages movie data to train transferable multimodal and cross-modal encoders.
Our key idea is to learn from all modalities in a movie by observing and
extracting relationships over a long-range. After pretraining, we run ablation
studies on the LVU benchmark and validate our modeling choices and the
importance of learning from long-range time spans. Our model achieves
state-of-the-art on several LVU tasks while being much more data efficient than
previous works. Finally, we evaluate our model's transferability by setting a
new state-of-the-art in five different benchmarks.
Related papers
- Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks.
It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens.
Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - Unlock the Power: Competitive Distillation for Multi-Modal Large
Language Models [17.25135606956287]
Competitive Multi-modal Distillation framework (CoMD) captures bidirectional feedback between teacher and student models.
Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model.
arXiv Detail & Related papers (2023-11-14T14:49:46Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - VindLU: A Recipe for Effective Video-and-Language Pretraining [83.49216853881595]
This paper conducts an empirical study demystifying the most important factors in the VidL model design.
Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining.
Our model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks.
arXiv Detail & Related papers (2022-12-09T18:54:05Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.