HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
- URL: http://arxiv.org/abs/2310.04900v1
- Date: Sat, 7 Oct 2023 19:32:55 GMT
- Title: HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
- Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt
Schiele, Hilde Kuehne
- Abstract summary: We propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval.
- Score: 77.02631712558251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instructional videos are an excellent source for learning multimodal
representations by leveraging video-subtitle pairs extracted with automatic
speech recognition systems (ASR) from the audio signal in the videos. However,
in contrast to human-annotated captions, both speech and subtitles naturally
differ from the visual content of the videos and thus provide only noisy
supervision for multimodal learning. As a result, large-scale annotation-free
web video training data remains sub-optimal for training text-video models. In
this work, we propose to leverage the capability of large language models
(LLMs) to obtain fine-grained video descriptions aligned with videos.
Specifically, we prompt an LLM to create plausible video descriptions based on
ASR narrations of the video for a large-scale instructional video dataset. To
this end, we introduce a prompting method that is able to take into account a
longer text of subtitles, allowing us to capture context beyond a single
sentence. To align the captions to the video temporally, we prompt the LLM to
generate timestamps for each produced caption based on the subtitles. In this
way, we obtain human-style video captions at scale without human supervision.
We apply our method to the subtitles of the HowTo100M dataset, creating a new
large-scale dataset, HowToCaption. Our evaluation shows that the resulting
captions not only significantly improve the performance over many different
benchmark datasets for text-video retrieval but also lead to a disentangling of
textual narration from the audio, boosting performance in text-video-audio
tasks.
Related papers
- Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.