Hierarchical Deep Residual Reasoning for Temporal Moment Localization
- URL: http://arxiv.org/abs/2111.00417v1
- Date: Sun, 31 Oct 2021 07:13:34 GMT
- Title: Hierarchical Deep Residual Reasoning for Temporal Moment Localization
- Authors: Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
- Abstract summary: We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
- Score: 48.108468456043994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Moment Localization (TML) in untrimmed videos is a challenging task
in the field of multimedia, which aims at localizing the start and end points
of the activity in the video, described by a sentence query. Existing methods
mainly focus on mining the correlation between video and sentence
representations or investigating the fusion manner of the two modalities. These
works mainly understand the video and sentence coarsely, ignoring the fact that
a sentence can be understood from various semantics, and the dominant words
affecting the moment localization in the semantics are the action and object
reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning
(HDRR) model, which decomposes the video and sentence into multi-level
representations with different semantics to achieve a finer-grained
localization. Furthermore, considering that videos with different resolution
and sentences with different length have different difficulty in understanding,
we design the simple yet effective Res-BiGRUs for feature fusion, which is able
to grasp the useful information in a self-adapting manner. Extensive
experiments conducted on Charades-STA and ActivityNet-Captions datasets
demonstrate the superiority of our HDRR model compared with other
state-of-the-art methods.
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.