Fine-grained Audible Video Description
- URL: http://arxiv.org/abs/2303.15616v1
- Date: Mon, 27 Mar 2023 22:03:48 GMT
- Title: Fine-grained Audible Video Description
- Authors: Xuyang Shen and Dong Li and Jinxing Zhou and Zhen Qin and Bowen He and
Xiaodong Han and Aixuan Li and Yuchao Dai and Lingpeng Kong and Meng Wang and
Yu Qiao and Yiran Zhong
- Abstract summary: We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
- Score: 61.81122862375985
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We explore a new task for audio-visual-language modeling called fine-grained
audible video description (FAVD). It aims to provide detailed textual
descriptions for the given audible videos, including the appearance and spatial
locations of each object, the actions of moving objects, and the sounds in
videos. Existing visual-language modeling tasks often concentrate on visual
cues in videos while undervaluing the language and audio modalities. On the
other hand, FAVD requires not only audio-visual-language modeling skills but
also paragraph-level language generation abilities. We construct the first
fine-grained audible video description benchmark (FAVDBench) to facilitate this
research. For each video clip, we first provide a one-sentence summary of the
video, ie, the caption, followed by 4-6 sentences describing the visual details
and 1-2 audio-related descriptions at the end. The descriptions are provided in
both English and Chinese. We create two new metrics for this task: an
EntityScore to gauge the completeness of entities in the visual descriptions,
and an AudioScore to assess the audio descriptions. As a preliminary approach
to this task, we propose an audio-visual-language transformer that extends
existing video captioning model with an additional audio branch. We combine the
masked language modeling and auto-regressive language modeling losses to
optimize our model so that it can produce paragraph-level descriptions. We
illustrate the efficiency of our model in audio-visual-language modeling by
evaluating it against the proposed benchmark using both conventional captioning
metrics and our proposed metrics. We further put our benchmark to the test in
video generation models, demonstrating that employing fine-grained video
descriptions can create more intricate videos than using captions.
Related papers
- DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.