A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention
- URL: http://arxiv.org/abs/2009.11232v1
- Date: Wed, 23 Sep 2020 16:03:00 GMT
- Title: A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention
- Authors: Binjie Zhang, Yu Li, Chun Yuan, Dejing Xu, Pin Jiang, Ying Shan
- Abstract summary: The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
- Score: 31.218804432716702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of language-guided video temporal grounding is to localize the
particular video clip corresponding to a query sentence in an untrimmed video.
Though progress has been made continuously in this field, some issues still
need to be resolved. First, most of the existing methods rely on the
combination of multiple complicated modules to solve the task. Second, due to
the semantic gaps between the two different modalities, aligning the
information at different granularities (local and global) between the video and
the language is significant, which is less addressed. Last, previous works do
not consider the inevitable annotation bias due to the ambiguities of action
boundaries. To address these limitations, we propose a simple two-branch
Cross-Modality Attention (CMA) module with intuitive structure design, which
alternatively modulates two modalities for better matching the information both
locally and globally. In addition, we introduce a new task-specific regression
loss function, which improves the temporal grounding accuracy by alleviating
the impact of annotation bias. We conduct extensive experiments to validate our
method, and the results show that just with this simple model, it can
outperform the state of the arts on both Charades-STA and ActivityNet Captions
datasets.
Related papers
- Multi-scale 2D Temporal Map Diffusion Models for Natural Language Video
Localization [85.85582751254785]
We present a novel approach to NLVL that aims to address this issue.
Our method involves the direct generation of a global 2D temporal map via a conditional denoising diffusion process.
Our approach effectively encapsulates the interaction between the query and video data across various time scales.
arXiv Detail & Related papers (2024-01-16T09:33:29Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.