Related papers: Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

URL: http://arxiv.org/abs/2110.10596v1
Date: Wed, 20 Oct 2021 14:45:13 GMT
Title: Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin Jin, Bryan Russell
Abstract summary: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
Score: 78.34818195786846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

Related papers

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Learning Actor-centered Representations for Action Localization in Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks. We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning. Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z)
MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents. We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network. We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.