Video-Guided Curriculum Learning for Spoken Video Grounding
- URL: http://arxiv.org/abs/2209.00277v1
- Date: Thu, 1 Sep 2022 07:47:01 GMT
- Title: Video-Guided Curriculum Learning for Spoken Video Grounding
- Authors: Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, Yi Ren
- Abstract summary: We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
- Score: 65.49979202728167
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a new task, spoken video grounding (SVG), which
aims to localize the desired video fragments from spoken language descriptions.
Compared with using text, employing audio requires the model to directly
exploit the useful phonemes and syllables related to the video from raw speech.
Moreover, we randomly add environmental noises to this speech audio, further
increasing the difficulty of this task and better simulating real applications.
To rectify the discriminative phonemes and extract video-related information
from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
during the audio pre-training process, which can make use of the vital visual
perceptions to help understand the spoken language and suppress the external
noise. Considering during inference the model can not obtain ground truth video
segments, we design a curriculum strategy that gradually shifts the input video
from the ground truth to the entire video content during pre-training. Finally,
the model can learn how to extract critical visual information from the entire
video clip to help understand the spoken language. In addition, we collect the
first large-scale spoken video grounding dataset based on ActivityNet, which is
named as ActivityNet Speech dataset. Extensive experiments demonstrate our
proposed video-guided curriculum learning can facilitate the pre-training
process to obtain a mutual audio encoder, significantly promoting the
performance of spoken video grounding tasks. Moreover, we prove that in the
case of noisy sound, our model outperforms the method that grounding video with
ASR transcripts, further demonstrating the effectiveness of our curriculum
strategy.
Related papers
- Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.