Video Moment Retrieval from Text Queries via Single Frame Annotation
- URL: http://arxiv.org/abs/2204.09409v1
- Date: Wed, 20 Apr 2022 11:59:17 GMT
- Title: Video Moment Retrieval from Text Queries via Single Frame Annotation
- Authors: Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen,
Xiaowei Guo, Huyang Sun, Yu-Gang Jiang
- Abstract summary: Video moment retrieval aims at finding the start and end timestamps of a moment described by a given natural language query.
Fully supervised methods need complete temporal boundary annotations to achieve promising results.
We propose a new paradigm called "glance annotation"
- Score: 65.92224946075693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval aims at finding the start and end timestamps of a
moment (part of a video) described by a given natural language query. Fully
supervised methods need complete temporal boundary annotations to achieve
promising results, which is costly since the annotator needs to watch the whole
moment. Weakly supervised methods only rely on the paired video and query, but
the performance is relatively poor. In this paper, we look closer into the
annotation process and propose a new paradigm called "glance annotation". This
paradigm requires the timestamp of only one single random frame, which we refer
to as a "glance", within the temporal boundary of the fully supervised
counterpart. We argue this is beneficial because comparing to weak supervision,
trivial cost is added yet more potential in performance is provided. Under the
glance annotation setting, we propose a method named as Video moment retrieval
via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input
video into clips and contrasts between clips and queries, in which glance
guided Gaussian distributed weights are assigned to all clips. Our extensive
experiments indicate that ViGA achieves better results than the
state-of-the-art weakly supervised methods by a large margin, even comparable
to fully supervised methods in some cases.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - Distill and Collect for Semi-Supervised Temporal Action Segmentation [0.0]
We propose an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences.
Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions.
Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos.
arXiv Detail & Related papers (2022-11-02T17:34:04Z) - Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval [20.493241098064665]
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
We propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN)
MPGN generates pseudo queries exploiting both visual and textual information from selected temporal moments.
We show that MPGN successfully learns to localize the video corpus moment without any explicit annotation.
arXiv Detail & Related papers (2022-10-23T05:05:18Z) - A Generalized & Robust Framework For Timestamp Supervision in Temporal
Action Segmentation [79.436224998992]
In temporal action segmentation, Timestamp supervision requires only a handful of labelled frames per video sequence.
We propose a novel Expectation-Maximization based approach that leverages the label uncertainty of unlabelled frames.
Our proposed method produces SOTA results and even exceeds the fully-supervised setup in several metrics and datasets.
arXiv Detail & Related papers (2022-07-20T18:30:48Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.