Related papers: HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

URL: http://arxiv.org/abs/2310.04900v1
Date: Sat, 7 Oct 2023 19:32:55 GMT
Title: HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne
Abstract summary: We propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval.
Score: 77.02631712558251
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.

Related papers

Controllable Hybrid Captioner for Improved Long-form Video Understanding [0.24578723416255746]
Video data is extremely dense and high-dimensional.<n>Text-based summaries of video content offer a way to represent content in a much more compact manner than raw.<n>We introduce Vision Language Models (VLMs) to enrich the memory with static scene descriptions.
arXiv Detail & Related papers (2025-07-22T22:09:00Z)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [24.203328970223527]
We present CaReBench, a testing benchmark for fine-grained video captioning and retrieval. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z)
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data. We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z)
Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks. Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description. We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.