A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot
- URL: http://arxiv.org/abs/2305.09758v3
- Date: Thu, 26 Oct 2023 10:08:31 GMT
- Title: A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot
- Authors: Aanisha Bhattacharya, Yaman K Singla, Balaji Krishnamurthy, Rajiv Ratn
Shah, Changyou Chen
- Abstract summary: We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
- Score: 67.00455874279383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimedia content, such as advertisements and story videos, exhibit a rich
blend of creativity and multiple modalities. They incorporate elements like
text, visuals, audio, and storytelling techniques, employing devices like
emotions, symbolism, and slogans to convey meaning. There is a dearth of large
annotated training datasets in the multimedia domain hindering the development
of supervised learning models with satisfactory performance for real-world
applications. On the other hand, the rise of large language models (LLMs) has
witnessed remarkable zero-shot performance in various natural language
processing (NLP) tasks, such as emotion classification, question-answering, and
topic classification. To leverage such advanced techniques to bridge this
performance gap in multimedia understanding, we propose verbalizing long videos
to generate their descriptions in natural language, followed by performing
video-understanding tasks on the generated story as opposed to the original
video. Through extensive experiments on fifteen video-understanding tasks, we
demonstrate that our method, despite being zero-shot, achieves significantly
better results than supervised baselines for video understanding. Furthermore,
to alleviate a lack of story understanding benchmarks, we publicly release the
first dataset on a crucial task in computational social science on persuasion
strategy identification.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.