Language as the Medium: Multimodal Video Classification through text
only
- URL: http://arxiv.org/abs/2309.10783v1
- Date: Tue, 19 Sep 2023 17:32:21 GMT
- Title: Language as the Medium: Multimodal Video Classification through text
only
- Authors: Laura Hanu, Anita L. Ver\H{o}, James Thewlis
- Abstract summary: We propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information.
Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2.
Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks.
- Score: 3.744589644319257
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite an exciting new wave of multimodal machine learning models, current
approaches still struggle to interpret the complex contextual relationships
between the different modalities present in videos. Going beyond existing
methods that emphasize simple activities or objects, we propose a new
model-agnostic approach for generating detailed textual descriptions that
captures multimodal video information. Our method leverages the extensive
knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason
about textual descriptions of the visual and aural modalities, obtained from
BLIP-2, Whisper and ImageBind. Without needing additional finetuning of
video-text models or datasets, we demonstrate that available LLMs have the
ability to use these multimodal textual descriptions as proxies for ``sight''
or ``hearing'' and perform zero-shot multimodal classification of videos
in-context. Our evaluations on popular action recognition benchmarks, such as
UCF-101 or Kinetics, show these context-rich descriptions can be successfully
used in video understanding tasks. This method points towards a promising new
research direction in multimodal classification, demonstrating how an interplay
between textual, visual and auditory machine learning models can enable more
holistic video understanding.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Highlight Timestamp Detection Model for Comedy Videos via Multimodal
Sentiment Analysis [1.6181085766811525]
We propose a multimodal structure to obtain state-of-the-art performance in this field.
We select several benchmarks for multimodal video understanding and apply the most suitable model to find the best performance.
arXiv Detail & Related papers (2021-05-28T08:39:19Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.