Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos
- URL: http://arxiv.org/abs/2110.06058v1
- Date: Tue, 12 Oct 2021 14:59:25 GMT
- Title: Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos
- Authors: Zongmeng Zhang, Xianjing Han, Xuemeng Song, Yan Yan and Liqiang Nie
- Abstract summary: This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
- Score: 55.52369116870822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on tackling the problem of temporal language localization
in videos, which aims to identify the start and end points of a moment
described by a natural language sentence in an untrimmed video. However, it is
non-trivial since it requires not only the comprehensive understanding of the
video and sentence query, but also the accurate semantic correspondence capture
between them. Existing efforts are mainly centered on exploring the sequential
relation among video clips and query words to reason the video and sentence
query, neglecting the other intra-modal relations (e.g., semantic similarity
among video clips and syntactic dependency among the query words). Towards this
end, in this work, we propose a Multi-modal Interaction Graph Convolutional
Network (MIGCN), which jointly explores the complex intra-modal relations and
inter-modal interactions residing in the video and sentence query to facilitate
the understanding and semantic correspondence capture of the video and sentence
query. In addition, we devise an adaptive context-aware localization method,
where the context information is taken into the candidate moments and the
multi-scale fully connected layers are designed to rank and adjust the boundary
of the generated coarse candidate moments with different lengths. Extensive
experiments on Charades-STA and ActivityNet datasets demonstrate the promising
performance and superior efficiency of our model.
Related papers
- Disentangle and denoise: Tackling context misalignment for video moment retrieval [16.939535169282262]
Video Moment Retrieval aims to locate in-context video moments according to a natural language query.
This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
arXiv Detail & Related papers (2024-08-14T15:00:27Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.