A Multi-level Alignment Training Scheme for Video-and-Language Grounding
- URL: http://arxiv.org/abs/2204.10938v2
- Date: Tue, 26 Apr 2022 01:40:19 GMT
- Title: A Multi-level Alignment Training Scheme for Video-and-Language Grounding
- Authors: Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai
- Abstract summary: A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space.
We developed a multi-level alignment training scheme to directly shape the encoding process.
Our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.
- Score: 9.866172676211905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To solve video-and-language grounding tasks, the key is for the network to
understand the connection between the two modalities. For a pair of video and
language description, their semantic relation is reflected by their encodings'
similarity. A good multi-modality encoder should be able to well capture both
inputs' semantics and encode them in the shared feature space where embedding
distance gets properly translated into their semantic similarity. In this work,
we focused on this semantic connection between video and language, and
developed a multi-level alignment training scheme to directly shape the
encoding process. Global and segment levels of video-language alignment pairs
were designed, based on the information similarity ranging from high-level
context to fine-grained semantics. The contrastive loss was used to contrast
the encodings' similarities between the positive and negative alignment pairs,
and to ensure the network is trained in such a way that similar information is
encoded closely in the shared feature space while information of different
semantics is kept apart. Our multi-level alignment training can be applied to
various video-and-language grounding tasks. Together with the task-specific
training loss, our framework achieved comparable performance to previous
state-of-the-arts on multiple video QA and retrieval datasets.
Related papers
- Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval [0.0]
Partially Relevant Video Retrieval(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query.<n>We point out the inherent ambiguity between text and video content based on their conceptual scope.<n>We propose a framework that incorporates this ambiguity into the model learning process.
arXiv Detail & Related papers (2025-06-09T06:44:45Z) - A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.