Exploring Efficient Foundational Multi-modal Models for Video Summarization
- URL: http://arxiv.org/abs/2410.07405v1
- Date: Wed, 9 Oct 2024 20:07:06 GMT
- Title: Exploring Efficient Foundational Multi-modal Models for Video Summarization
- Authors: Karan Samel, Apoorva Beedu, Nitish Sontakke, Irfan Essa,
- Abstract summary: Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space.
We propose a plug-and-play video language model, using texts generated from each input modality into the language model.
We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods.
- Score: 15.418001616659808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pre-training by aligning outputs from each modality-specific model into the same embedding space. Then the embeddings from each model are used within a language model, which is fine-tuned on a desired instruction set. Aligning each modality during pre-training is computationally expensive and prevents rapid testing of different base modality models. During fine-tuning, evaluation is carried out within in-domain videos where it is hard to understand the generalizability and data efficiency of these methods. To alleviate these issues we propose a plug-and-play video language model. It directly uses the texts generated from each input modality into the language model, avoiding pre-training alignment overhead. Instead of fine-tuning we leverage few-shot instruction adaptation strategies. We compare the performance versus the computational costs for our plug-and-play style method and baseline tuning methods. Finally, we explore the generalizability of each method during domain shift and present insights on what data is useful when training data is limited. Through this analysis, we present practical insights on how to leverage multi-modal foundational models for effective results given realistic compute and data limitations.
Related papers
- Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches.
We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Actuarial Applications of Natural Language Processing Using
Transformers: Case Studies for Using Text Features in an Actuarial Context [0.0]
This tutorial demonstrates to incorporate text data into actuarial classification and regression tasks.
The main focus is on methods employing transformer-based models.
The case studies tackle challenges related to a multi-lingual setting and long input sequences.
arXiv Detail & Related papers (2022-06-04T15:39:30Z) - Mutual Modality Learning for Video Action Classification [74.83718206963579]
We show how to embed multi-modality into a single model for video action classification.
We achieve state-of-the-art results in the Something-Something-v2 benchmark.
arXiv Detail & Related papers (2020-11-04T21:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.