Related papers: Video and Text Matching with Conditioned Embeddings

Video and Text Matching with Conditioned Embeddings

URL: http://arxiv.org/abs/2110.11298v1
Date: Thu, 21 Oct 2021 17:31:50 GMT
Title: Video and Text Matching with Conditioned Embeddings
Authors: Ameen Ali, Idan Schwartz, Tamir Hazan, Lior Wolf
Abstract summary: We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. In this work, we encode the dataset data in a way that takes into account the query's relevant information. We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
Score: 81.81028089100727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of the other. In this work, we encode the dataset data in a way that takes into account the query's relevant information. The power of the method is demonstrated to arise from pooling the interaction data between words and frames. Since the encoding of the video clip depends on the sentence compared to it, the representation needs to be recomputed for each potential match. To this end, we propose an efficient shallow neural network. Its training employs a hierarchical triplet loss that is extendable to paragraph/video matching. The method is simple, provides explainability, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC. We also show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX. Source code is available at https://github.com/AmeenAli/VideoMatch.

Related papers

T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval [5.246077644648122]
We introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities.<n>The goal of T2V is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models.<n> Experimental results demonstrate that T2V achieves accurate partial alignment through effective cross-modal content decomposition.
arXiv Detail & Related papers (2025-07-28T04:55:27Z)
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z)
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided. We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video. Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z)
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles. We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z)
Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video. Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z)
Bi-Calibration Networks for Weakly-Supervised Video Representation Learning [153.54638582696128]
We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
arXiv Detail & Related papers (2022-06-21T16:02:12Z)
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video. We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z)
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.