Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos
- URL: http://arxiv.org/abs/2008.02448v1
- Date: Thu, 6 Aug 2020 04:09:03 GMT
- Title: Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos
- Authors: Xiaoye Qu, Pengwei Tang, Zhikang Zhou, Yu Cheng, Jianfeng Dong, Pan
Zhou
- Abstract summary: Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
- Score: 63.94898634140878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal language localization in videos aims to ground one video segment in
an untrimmed video based on a given sentence query. To tackle this task,
designing an effective model to extract ground-ing information from both visual
and textual modalities is crucial. However, most previous attempts in this
field only focus on unidirectional interactions from video to query, which
emphasizes which words to listen and attends to sentence information via
vanilla soft attention, but clues from query-by-video interactions implying
where to look are not taken into consideration. In this paper, we propose a
Fine-grained Iterative Attention Network (FIAN) that consists of an iterative
attention module for bilateral query-video in-formation extraction.
Specifically, in the iterative attention module, each word in the query is
first enhanced by attending to each frame in the video through fine-grained
attention, then video iteratively attends to the integrated query. Finally,
both video and query information is utilized to provide robust cross-modal
representation for further moment localization. In addition, to better predict
the target segment, we propose a content-oriented localization strategy instead
of applying recent anchor-based localization. We evaluate the proposed method
on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and
Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.
Related papers
- Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.