Local-Global Video-Text Interactions for Temporal Grounding
- URL: http://arxiv.org/abs/2004.07514v1
- Date: Thu, 16 Apr 2020 08:10:41 GMT
- Title: Local-Global Video-Text Interactions for Temporal Grounding
- Authors: Jonghwan Mun, Minsu Cho, Bohyung Han
- Abstract summary: This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
- Score: 77.5114709695216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of text-to-video temporal grounding, which
aims to identify the time interval in a video semantically relevant to a text
query. We tackle this problem using a novel regression-based model that learns
to extract a collection of mid-level features for semantic phrases in a text
query, which corresponds to important semantic entities described in the query
(e.g., actors, objects, and actions), and reflect bi-modal interactions between
the linguistic features of the query and the visual features of the video in
multiple levels. The proposed method effectively predicts the target time
interval by exploiting contextual information from local to global during
bi-modal interactions. Through in-depth ablation studies, we find out that
incorporating both local and global context in video and text interactions is
crucial to the accurate grounding. Our experiment shows that the proposed
method outperforms the state of the arts on Charades-STA and ActivityNet
Captions datasets by large margins, 7.44\% and 4.61\% points at Recall@tIoU=0.5
metric, respectively. Code is available in
https://github.com/JonghwanMun/LGI4temporalgrounding.
Related papers
- Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval [59.990432265734384]
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
arXiv Detail & Related papers (2021-04-20T15:26:24Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.