Weakly-Supervised Action Detection Guided by Audio Narration
- URL: http://arxiv.org/abs/2205.05895v1
- Date: Thu, 12 May 2022 06:33:24 GMT
- Title: Weakly-Supervised Action Detection Guided by Audio Narration
- Authors: Keren Ye and Adriana Kovashka
- Abstract summary: We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
- Score: 50.4318060593995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos are more well-organized curated data sources for visual concept
learning than images. Unlike the 2-dimensional images which only involve the
spatial information, the additional temporal dimension bridges and synchronizes
multiple modalities. However, in most video detection benchmarks, these
additional modalities are not fully utilized. For example, EPIC Kitchens is the
largest dataset in first-person (egocentric) vision, yet it still relies on
crowdsourced information to refine the action boundaries to provide
instance-level action annotations.
We explored how to eliminate the expensive annotations in video detection
data which provide refined boundaries. We propose a model to learn from the
narration supervision and utilize multimodal features, including RGB, motion
flow, and ambient sound. Our model learns to attend to the frames related to
the narration label while suppressing the irrelevant frames from being used.
Our experiments show that noisy audio narration suffices to learn a good action
detection model, thus reducing annotation expenses.
Related papers
- Few-shot Action Recognition via Intra- and Inter-Video Information
Maximization [28.31541961943443]
We propose a novel framework, Video Information Maximization (VIM), for few-shot action recognition.
VIM is equipped with an adaptive spatial-temporal video sampler and atemporal action alignment model.
VIM acts to maximize the distinctiveness of video information from limited video data.
arXiv Detail & Related papers (2023-05-10T13:05:43Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Learnable Irrelevant Modality Dropout for Multimodal Action Recognition
on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition.
We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels.
We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Where and When: Space-Time Attention for Audio-Visual Explanations [42.093794819606444]
We propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time.
Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear.
arXiv Detail & Related papers (2021-05-04T14:16:55Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.