DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video
- URL: http://arxiv.org/abs/2010.06260v1
- Date: Tue, 13 Oct 2020 09:50:29 GMT
- Title: DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video
- Authors: Cristian Rodriguez-Opazo and Edison Marrese-Taylor and Basura Fernando
and Hongdong Li and Stephen Gould
- Abstract summary: We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
- Score: 98.54696229182335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the task of temporal moment localization in a long
untrimmed video using natural language query. Given a query sentence, the goal
is to determine the start and end of the relevant segment within the video. Our
key innovation is to learn a video feature embedding through a
language-conditioned message-passing algorithm suitable for temporal moment
localization which captures the relationships between humans, objects and
activities in the video. These relationships are obtained by a spatial
sub-graph that contextualizes the scene representation using detected objects
and human features conditioned in the language query. Moreover, a temporal
sub-graph captures the activities within the video through time. Our method is
evaluated on three standard benchmark datasets, and we also introduce YouCookII
as a new benchmark for this task. Experiments show our method outperforms
state-of-the-art methods on these datasets, confirming the effectiveness of our
approach.
Related papers
- A Survey on Video Moment Localization [61.5323647499912]
Video moment localization aims to search a target segment within a video described by a given natural language query.
We present a review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones.
We discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
arXiv Detail & Related papers (2023-06-13T02:57:32Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.