Multi-scale 2D Representation Learning for weakly-supervised moment
retrieval
- URL: http://arxiv.org/abs/2111.02741v1
- Date: Thu, 4 Nov 2021 10:48:37 GMT
- Title: Multi-scale 2D Representation Learning for weakly-supervised moment
retrieval
- Authors: Ding Li, Rui Wu, Yongqiang Tang, Zhizhong Zhang and Wensheng Zhang
- Abstract summary: We propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval.
Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates.
We select top-K candidates from each scale-varied map with a learnable convolutional neural network.
- Score: 18.940164141627914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval aims to search the moment most relevant to a given
language query. However, most existing methods in this community often require
temporal boundary annotations which are expensive and time-consuming to label.
Hence weakly supervised methods have been put forward recently by only using
coarse video-level label. Despite effectiveness, these methods usually process
moment candidates independently, while ignoring a critical issue that the
natural temporal dependencies between candidates in different temporal scales.
To cope with this issue, we propose a Multi-scale 2D Representation Learning
method for weakly supervised video moment retrieval. Specifically, we first
construct a two-dimensional map for each temporal scale to capture the temporal
dependencies between candidates. Two dimensions in this map indicate the start
and end time points of these candidates. Then, we select top-K candidates from
each scale-varied map with a learnable convolutional neural network. With a
newly designed Moments Evaluation Module, we obtain the alignment scores of the
selected candidates. At last, the similarity between captions and language
query is served as supervision for further training the candidates' selector.
Experiments on two benchmark datasets Charades-STA and ActivityNet Captions
demonstrate that our approach achieves superior performance to state-of-the-art
results.
Related papers
- Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Candidate Set Re-ranking for Composed Image Retrieval with Dual
Multi-modal Encoder [45.60134971181856]
Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair.
Existing methods pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time.
We propose to combine the merits of both schemes using a two-stage model.
arXiv Detail & Related papers (2023-05-25T17:56:24Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with
Natural Language [112.32586622873731]
We address the problem of retrieving a specific moment from an untrimmed video by natural language.
We model the temporal context between video moments by a set of predefined two-dimensional maps under different temporal scales.
Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization.
arXiv Detail & Related papers (2020-12-04T15:09:35Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - VLANet: Video-Language Alignment Network for Weakly-Supervised Video
Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query.
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR)
The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z) - Document Modeling with Graph Attention Networks for Multi-grained
Machine Reading Comprehension [127.3341842928421]
Natural Questions is a new challenging machine reading comprehension benchmark.
It has two-grained answers, which are a long answer (typically a paragraph) and a short answer (one or more entities inside the long answer)
Existing methods treat these two sub-tasks individually during training while ignoring their dependencies.
We present a novel multi-grained machine reading comprehension framework that focuses on modeling documents at their hierarchical nature.
arXiv Detail & Related papers (2020-05-12T14:20:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.