MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
- URL: http://arxiv.org/abs/2212.13163v2
- Date: Tue, 27 Dec 2022 05:14:51 GMT
- Title: MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
- Authors: Wei Ji, Long Chen, Yinwei Wei, Yiming Wu, Tat-Seng Chua
- Abstract summary: We propose a novel multi-resolution temporal video sentence grounding network: MRTNet.
MRTNet consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module.
Our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models.
- Score: 70.82093938170051
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given an untrimmed video and natural language query, video sentence grounding
aims to localize the target temporal moment in the video. Existing methods
mainly tackle this task by matching and aligning semantics of the descriptive
sentence and video segments on a single temporal resolution, while neglecting
the temporal consistency of video content in different resolutions. In this
work, we propose a novel multi-resolution temporal video sentence grounding
network: MRTNet, which consists of a multi-modal feature encoder, a
Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is
an encoder-decoder network, and output features in the decoder part are in
conjunction with Transformers to predict the final start and end timestamps.
Particularly, our MRT module is hot-pluggable, which means it can be seamlessly
incorporated into any anchor-free models. Besides, we utilize a hybrid loss to
supervise cross-modal features in MRT module for more accurate grounding in
three scales: frame-level, clip-level and sequence-level. Extensive experiments
on three prevalent datasets have shown the effectiveness of MRTNet.
Related papers
- Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection [5.068440399797739]
Current RGB-Thermal Video Object Detection (RGBT VOD) methods depend on manually aligning data at the image level.
We propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT pairs.
We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet.
arXiv Detail & Related papers (2024-10-16T01:06:12Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - MH-DETR: Video Moment and Highlight Detection with Cross-modal
Transformer [17.29632719667594]
We propose MH-DETR (Moment and Highlight Detection Transformer) tailored for video moment and highlight detection (MHD)
We introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context.
To obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder.
arXiv Detail & Related papers (2023-04-29T22:50:53Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Temporal Modulation Network for Controllable Space-Time Video
Super-Resolution [66.06549492893947]
Space-time video super-resolution aims to increase the spatial and temporal resolutions of low-resolution and low-frame-rate videos.
Deformable convolution based methods have achieved promising STVSR performance, but they could only infer the intermediate frame pre-defined in the training stage.
We propose a Temporal Modulation Network (TMNet) to interpolate arbitrary intermediate frame(s) with accurate high-resolution reconstruction.
arXiv Detail & Related papers (2021-04-21T17:10:53Z) - UNETR: Transformers for 3D Medical Image Segmentation [8.59571749685388]
We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a pure transformer as the encoder to learn sequence representations of the input volume.
We have extensively validated the performance of our proposed model across different imaging modalities.
arXiv Detail & Related papers (2021-03-18T20:17:15Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.