What You Say Is What You Show: Visual Narration Detection in
Instructional Videos
- URL: http://arxiv.org/abs/2301.02307v2
- Date: Tue, 18 Jul 2023 17:29:16 GMT
- Title: What You Say Is What You Show: Visual Narration Detection in
Instructional Videos
- Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
- Abstract summary: We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
- Score: 108.77600799637172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Narrated ''how-to'' videos have emerged as a promising data source for a wide
range of learning problems, from learning visual representations to training
robot policies. However, this data is extremely noisy, as the narrations do not
always describe the actions demonstrated in the video. To address this problem
we introduce the novel task of visual narration detection, which entails
determining whether a narration is visually depicted by the actions in the
video. We propose What You Say is What You Show (WYS^2), a method that
leverages multi-modal cues and pseudo-labeling to learn to detect visual
narrations with only weakly labeled data. Our model successfully detects visual
narrations in in-the-wild videos, outperforming strong baselines, and we
demonstrate its impact for state-of-the-art summarization and temporal
alignment of instructional videos.
Related papers
- Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.