Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with
Natural Language
- URL: http://arxiv.org/abs/2012.02646v1
- Date: Fri, 4 Dec 2020 15:09:35 GMT
- Title: Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with
Natural Language
- Authors: Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, Jiebo Luo
- Abstract summary: We address the problem of retrieving a specific moment from an untrimmed video by natural language.
We model the temporal context between video moments by a set of predefined two-dimensional maps under different temporal scales.
Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization.
- Score: 112.32586622873731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of retrieving a specific moment from an untrimmed
video by natural language. It is a challenging problem because a target moment
may take place in the context of other temporal moments in the untrimmed video.
Existing methods cannot tackle this challenge well since they do not fully
consider the temporal contexts between temporal moments. In this paper, we
model the temporal context between video moments by a set of predefined
two-dimensional maps under different temporal scales. For each map, one
dimension indicates the starting time of a moment and the other indicates the
duration. These 2D temporal maps can cover diverse video moments with different
lengths, while representing their adjacent contexts at different temporal
scales. Based on the 2D temporal maps, we propose a Multi-Scale Temporal
Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization.
It is capable of encoding the adjacent temporal contexts at each scale, while
learning discriminative features for matching video moments with referring
expressions. We evaluate the proposed MS-2D-TAN on three challenging
benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our
MS-2D-TAN outperforms the state of the art.
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video
Localization [85.85582751254785]
We present a novel approach to NLVL that aims to address this issue.
Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process.
Our approach effectively encapsulates the interaction between the query and video data across various time scales.
arXiv Detail & Related papers (2024-01-16T09:33:29Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Multi-scale 2D Representation Learning for weakly-supervised moment
retrieval [18.940164141627914]
We propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval.
Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates.
We select top-K candidates from each scale-varied map with a learnable convolutional neural network.
arXiv Detail & Related papers (2021-11-04T10:48:37Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z) - Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.