MS-DETR: Natural Language Video Localization with Sampling Moment-Moment
Interaction
- URL: http://arxiv.org/abs/2305.18969v1
- Date: Tue, 30 May 2023 12:06:35 GMT
- Title: MS-DETR: Natural Language Video Localization with Sampling Moment-Moment
Interaction
- Authors: Jing Wang, Aixin Sun, Hao Zhang, and Xiaoli Li
- Abstract summary: Given a query, the task of Natural Language Video localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query.
In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal.
On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling.
- Score: 28.21563211881665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a query, the task of Natural Language Video Localization (NLVL) is to
localize a temporal moment in an untrimmed video that semantically matches the
query. In this paper, we adopt a proposal-based solution that generates
proposals (i.e., candidate moments) and then select the best matching proposal.
On top of modeling the cross-modal interaction between candidate moments and
the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient
moment-moment relation modeling. The core idea is to sample a subset of moments
guided by the learnable templates with an adopted DETR (DEtection TRansformer)
framework. To achieve this, we design a multi-scale visual-linguistic encoder,
and an anchor-guided moment decoder paired with a set of learnable templates.
Experimental results on three public datasets demonstrate the superior
performance of MS-DETR.
Related papers
- Localizing Events in Videos with Multimodal Queries [71.40602125623668]
We introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries.
We include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - Context-Enhanced Video Moment Retrieval with Large Language Models [22.283367604425916]
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives.
We propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation.
Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28% and 4.06% on the challenging QVHighlights and Charades-STA benchmarks.
arXiv Detail & Related papers (2024-05-21T07:12:27Z) - TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and
Highlight Detection [9.032057312774564]
Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks.
Several methods have been devoted to building DETR-based networks to solve both MR and HD jointly.
We propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD.
arXiv Detail & Related papers (2024-01-04T14:55:57Z) - Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval [20.493241098064665]
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
We propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN)
MPGN generates pseudo queries exploiting both visual and textual information from selected temporal moments.
We show that MPGN successfully learns to localize the video corpus moment without any explicit annotation.
arXiv Detail & Related papers (2022-10-23T05:05:18Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Relation-aware Video Reading Comprehension for Temporal Language
Grounding [67.5613853693704]
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it.
arXiv Detail & Related papers (2021-10-12T03:10:21Z) - Progressive Localization Networks for Language-based Moment Localization [56.54450664871467]
This paper focuses on the task of language-based moment localization.
Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment.
We propose a novel multi-stage Progressive localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-02-02T03:45:59Z) - VLANet: Video-Language Alignment Network for Weakly-Supervised Video
Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query.
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR)
The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.