Hierarchical Local-Global Transformer for Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2208.14882v1
- Date: Wed, 31 Aug 2022 14:16:56 GMT
- Title: Hierarchical Local-Global Transformer for Temporal Sentence Grounding
- Authors: Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu and Ruixuan Li
- Abstract summary: This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
- Score: 58.247592985849124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the multimedia problem of temporal sentence grounding
(TSG), which aims to accurately determine the specific video segment in an
untrimmed video according to a given sentence query. Traditional TSG methods
mainly follow the top-down or bottom-up framework and are not end-to-end. They
severely rely on time-consuming post-processing to refine the grounding
results. Recently, some transformer-based approaches are proposed to
efficiently and effectively model the fine-grained semantic alignment between
video and query. Although these methods achieve significant performance to some
extent, they equally take frames of the video and words of the query as
transformer input for correlating, failing to capture their different levels of
granularity with distinct semantics. To address this issue, in this paper, we
propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this
hierarchy information and model the interactions between different levels of
granularity and different modalities for learning more fine-grained multi-modal
representations. Specifically, we first split the video and query into
individual clips and phrases to learn their local context (adjacent dependency)
and global correlation (long-range dependency) via a temporal transformer.
Then, a global-local transformer is introduced to learn the interactions
between the local-level and global-level semantics for better multi-modal
reasoning. Besides, we develop a new cross-modal cycle-consistency loss to
enforce interaction between two modalities and encourage the semantic alignment
between them. Finally, we design a brand-new cross-modal parallel transformer
decoder to integrate the encoded visual and textual features for final
grounding. Extensive experiments on three challenging datasets show that our
proposed HLGT achieves a new state-of-the-art performance.
Related papers
- Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Generation-Guided Multi-Level Unified Network for Video Grounding [18.402093379973085]
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
arXiv Detail & Related papers (2023-03-14T09:48:59Z) - Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer [13.71165050314854]
We present a new method for end-to-end Video Questioning (VideoQA)
We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer.
We demonstrate better or on-par performances with high computational efficiency against state-the-art methods on five VideoQA benchmarks.
arXiv Detail & Related papers (2023-02-04T09:14:18Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval [40.646628490887075]
We propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval.
HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results.
Inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly.
arXiv Detail & Related papers (2021-03-28T04:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.