Multi-modal Dense Video Captioning
- URL: http://arxiv.org/abs/2003.07758v2
- Date: Tue, 5 May 2020 18:12:10 GMT
- Title: Multi-modal Dense Video Captioning
- Authors: Vladimir Iashin and Esa Rahtu
- Abstract summary: We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
- Score: 18.592384822257948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense video captioning is a task of localizing interesting events from an
untrimmed video and producing textual description (captions) for each localized
event. Most of the previous works in dense video captioning are solely based on
visual information and completely ignore the audio track. However, audio, and
speech, in particular, are vital cues for a human observer in understanding an
environment. In this paper, we present a new dense video captioning approach
that is able to utilize any number of modalities for event description.
Specifically, we show how audio and speech modalities may improve a dense video
captioning model. We apply automatic speech recognition (ASR) system to obtain
a temporally aligned textual description of the speech (similar to subtitles)
and treat it as a separate input alongside video frames and the corresponding
audio track. We formulate the captioning task as a machine translation problem
and utilize recently proposed Transformer architecture to convert multi-modal
input data into textual descriptions. We demonstrate the performance of our
model on ActivityNet Captions dataset. The ablation studies indicate a
considerable contribution from audio and speech components suggesting that
these modalities contain substantial complementary information to video frames.
Furthermore, we provide an in-depth analysis of the ActivityNet Caption results
by leveraging the category tags obtained from original YouTube videos. Code is
publicly available: github.com/v-iashin/MDVC
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.