Support-Set Based Cross-Supervision for Video Grounding
- URL: http://arxiv.org/abs/2108.10576v1
- Date: Tue, 24 Aug 2021 08:25:26 GMT
- Title: Support-Set Based Cross-Supervision for Video Grounding
- Authors: Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan
Huang, Mingqian Tang, Xinbo Gao
- Abstract summary: Support-set Based Cross-Supervision (Sscs) module can improve existing methods during training phase without extra inference cost.
The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective.
We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins.
- Score: 98.29089558426399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current approaches for video grounding propose kinds of complex architectures
to capture the video-text relations, and have achieved impressive improvements.
However, it is hard to learn the complicated multi-modal relations by only
architecture designing in fact. In this paper, we introduce a novel Support-set
Based Cross-Supervision (Sscs) module which can improve existing methods during
training phase without extra inference cost. The proposed Sscs module contains
two main components, i.e., discriminative contrastive objective and generative
caption objective. The contrastive objective aims to learn effective
representations by contrastive learning, while the caption objective can train
a powerful video encoder supervised by texts. Due to the co-existence of some
visual entities in both ground-truth and background intervals, i.e., mutual
exclusion, naively contrastive learning is unsuitable to video grounding. We
address the problem by boosting the cross-supervision with the support-set
concept, which collects visual information from the whole video and eliminates
the mutual exclusion of entities. Combined with the original objectives, Sscs
can enhance the abilities of multi-modal relation modeling for existing
approaches. We extensively evaluate Sscs on three challenging datasets, and
show that our method can improve current state-of-the-art methods by large
margins, especially 6.35% in terms of R1@0.5 on Charades-STA.
Related papers
- MV2MAE: Multi-View Video Masked Autoencoders [33.61642891911761]
We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
arXiv Detail & Related papers (2024-01-29T05:58:23Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.