Text-based Localization of Moments in a Video Corpus
- URL: http://arxiv.org/abs/2008.08716v2
- Date: Wed, 18 Aug 2021 23:08:48 GMT
- Title: Text-based Localization of Moments in a Video Corpus
- Authors: Sudipta Paul, Niluthpol Chowdhury Mithun, and Amit K. Roy-Chowdhury
- Abstract summary: We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
- Score: 38.393877654679414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works on text-based video moment localization focus on temporally
grounding the textual query in an untrimmed video. These works assume that the
relevant video is already known and attempt to localize the moment on that
relevant video only. Different from such works, we relax this assumption and
address the task of localizing moments in a corpus of videos for a given
sentence query. This task poses a unique challenge as the system is required to
perform: (i) retrieval of the relevant video where only a segment of the video
corresponds with the queried sentence, and (ii) temporal localization of moment
in the relevant video based on sentence query. Towards overcoming this
challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns
an effective joint embedding space for moments and sentences. In addition to
learning subtle differences between intra-video moments, HMAN focuses on
distinguishing inter-video global semantic concepts based on sentence queries.
Qualitative and quantitative results on three benchmark text-based video moment
retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions -
demonstrate that our method achieves promising performance on the proposed task
of temporal localization of moments in a corpus of videos.
Related papers
- Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.