Parallel Attention Network with Sequence Matching for Video Grounding
- URL: http://arxiv.org/abs/2105.08481v1
- Date: Tue, 18 May 2021 12:43:20 GMT
- Title: Parallel Attention Network with Sequence Matching for Video Grounding
- Authors: Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, Rick
Siow Mong Goh
- Abstract summary: Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query.
We propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task.
- Score: 56.649826885121264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a video, video grounding aims to retrieve a temporal moment that
semantically corresponds to a language query. In this work, we propose a
Parallel Attention Network with Sequence matching (SeqPAN) to address the
challenges in this task: multi-modal representation learning, and target moment
boundary prediction. We design a self-guided parallel attention module to
effectively capture self-modal contexts and cross-modal attentive information
between video and text. Inspired by sequence labeling tasks in natural language
processing, we split the ground truth moment into begin, inside, and end
regions. We then propose a sequence matching strategy to guide start/end
boundary predictions using region labels. Experimental results on three
datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore,
the effectiveness of the self-guided parallel attention module and the sequence
matching module is verified.
Related papers
- VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Looking for the Signs: Identifying Isolated Sign Instances in Continuous
Video Footage [45.29710323525548]
We propose a transformer-based network, called SignLookup, to extract-temporal representations from video clips.
Our model achieves state-of-the-art performance on the sign spotting task with accuracy as high as 96% on challenging benchmark datasets.
arXiv Detail & Related papers (2021-07-21T12:49:44Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.