Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding
- URL: http://arxiv.org/abs/2103.11555v1
- Date: Mon, 22 Mar 2021 03:13:05 GMT
- Title: Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding
- Authors: Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei,
Zichuan Xu, Yulai Xie
- Abstract summary: This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
- Score: 61.18824806906945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of temporal sentence grounding (TSG), which
aims to identify the temporal boundary of a specific segment from an untrimmed
video by a sentence query. Previous works either compare pre-defined candidate
segments with the query and select the best one by ranking, or directly regress
the boundary timestamps of the target segment. In this paper, we propose a
novel localization framework that scores all pairs of start and end indices
within the video simultaneously with a biaffine mechanism. In particular, we
present a Context-aware Biaffine Localizing Network (CBLN) which incorporates
both local and global contexts into features of each start/end position for
biaffine-based localization. The local contexts from the adjacent frames help
distinguish the visually similar appearance, and the global contexts from the
entire video contribute to reasoning the temporal relation. Besides, we also
develop a multi-modal self-attention module to provide fine-grained
query-guided video representation for this biaffine strategy. Extensive
experiments show that our CBLN significantly outperforms state-of-the-arts on
three public datasets (ActivityNet Captions, TACoS, and Charades-STA),
demonstrating the effectiveness of the proposed localization framework.
Related papers
- Disentangle and denoise: Tackling context misalignment for video moment retrieval [16.939535169282262]
Video Moment Retrieval aims to locate in-context video moments according to a natural language query.
This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
arXiv Detail & Related papers (2024-08-14T15:00:27Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.