Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
Weakly-Supervised Query-based Video Grounding
- URL: http://arxiv.org/abs/2203.03838v1
- Date: Tue, 8 Mar 2022 04:01:08 GMT
- Title: Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
Weakly-Supervised Query-based Video Grounding
- Authors: Shentong Mo, Daizong Liu, Wei Hu
- Abstract summary: We propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting.
Firstly, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames.
Secondly, since some predicted frames are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm.
- Score: 27.05117092371221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Query-based video grounding is an important yet challenging task in video
understanding, which aims to localize the target segment in an untrimmed video
according to a sentence query. Most previous works achieve significant progress
by addressing this task in a fully-supervised manner with segment-level labels,
which require high labeling cost. Although some recent efforts develop
weakly-supervised methods that only need the video-level knowledge, they
generally match multiple pre-defined segment proposals with query and select
the best one, which lacks fine-grained frame-level details for distinguishing
frames with high repeatability and similarity within the entire video. To
alleviate the above limitations, we propose a self-contrastive learning
framework to address the query-based video grounding task under a
weakly-supervised setting. Firstly, instead of utilizing redundant segment
proposals, we propose a new grounding scheme that learns frame-wise matching
scores referring to the query semantic to predict the possible foreground
frames by only using the video-level annotations. Secondly, since some
predicted frames (i.e., boundary frames) are relatively coarse and exhibit
similar appearance to their adjacent frames, we propose a coarse-to-fine
contrastive learning paradigm to learn more discriminative frame-wise
representations for distinguishing the false positive frames. In particular, we
iteratively explore multi-scale hard negative samples that are close to
positive samples in the representation space for distinguishing fine-grained
frame-wise details, thus enforcing more accurate segment grounding. Extensive
experiments on two challenging benchmarks demonstrate the superiority of our
proposed method compared with the state-of-the-art methods.
Related papers
- Multi-View Video-Based Learning: Leveraging Weak Labels for Frame-Level Perception [1.5741307755393597]
We propose a novel learning framework to train a video-based action recognition model with weak labels for frame-level perception.
For training the model using the weak labels, we propose a novel latent loss function.
We also propose a model that uses the view-specific latent embeddings for downstream frame-level action recognition and detection tasks.
arXiv Detail & Related papers (2024-03-18T09:47:41Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations [12.139451002212063]
SSVOD exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations.
Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS.
arXiv Detail & Related papers (2023-09-04T06:41:33Z) - Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z) - Rethinking the Video Sampling and Reasoning Strategies for Temporal
Sentence Grounding [64.99924160432144]
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames.
arXiv Detail & Related papers (2023-01-02T03:38:22Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.