Related papers: A Multi-level Alignment Training Scheme for Video-and-Language Grounding

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

URL: http://arxiv.org/abs/2204.10938v2
Date: Tue, 26 Apr 2022 01:40:19 GMT
Title: A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Authors: Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai
Abstract summary: A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space. We developed a multi-level alignment training scheme to directly shape the encoding process. Our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.
Score: 9.866172676211905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings' similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.

Related papers

Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval [0.0]
Partially Relevant Video Retrieval(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query.<n>We point out the inherent ambiguity between text and video content based on their conceptual scope.<n>We propose a framework that incorporates this ambiguity into the model learning process.
arXiv Detail & Related papers (2025-06-09T06:44:45Z)
A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework. We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z)
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos. It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query. We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.