Related papers: Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

URL: http://arxiv.org/abs/2111.00417v1
Date: Sun, 31 Oct 2021 07:13:34 GMT
Title: Hierarchical Deep Residual Reasoning for Temporal Moment Localization
Authors: Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
Abstract summary: We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics. We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
Score: 48.108468456043994
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.

Related papers

Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss. Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.