Multiple Visual-Semantic Embedding for Video Retrieval from Query
Sentence
- URL: http://arxiv.org/abs/2004.07967v1
- Date: Thu, 16 Apr 2020 21:12:32 GMT
- Title: Multiple Visual-Semantic Embedding for Video Retrieval from Query
Sentence
- Authors: Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi
- Abstract summary: Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other.
A single space is not enough to accommodate various videos and sentences.
We propose a novel framework that maps instances into multiple individual embedding spaces.
- Score: 8.602553195689513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual-semantic embedding aims to learn a joint embedding space where related
video and sentence instances are located close to each other. Most existing
methods put instances in a single embedding space. However, they struggle to
embed instances due to the difficulty of matching visual dynamics in videos to
textual features in sentences. A single space is not enough to accommodate
various videos and sentences. In this paper, we propose a novel framework that
maps instances into multiple individual embedding spaces so that we can capture
multiple relationships between instances, leading to compelling video
retrieval. We propose to produce a final similarity between instances by fusing
similarities measured in each embedding space using a weighted sum strategy. We
determine the weights according to a sentence. Therefore, we can flexibly
emphasize an embedding space. We conducted sentence-to-video retrieval
experiments on a benchmark dataset. The proposed method achieved superior
performance, and the results are competitive to state-of-the-art methods. These
experimental results demonstrated the effectiveness of the proposed multiple
embedding approach compared to existing methods.
Related papers
- Video-Text Retrieval by Supervised Sparse Multi-Grained Learning [22.17732989393653]
We present a novel multi-grained sparse learning framework, S3MA, to learn an sparse space shared between the video and the text for video-text retrieval.
With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses.
Benefiting from the learned shared sparse space and multi-grained similarities, experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods.
arXiv Detail & Related papers (2023-02-19T04:03:22Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Video and Text Matching with Conditioned Embeddings [81.81028089100727]
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa.
In this work, we encode the dataset data in a way that takes into account the query's relevant information.
We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
arXiv Detail & Related papers (2021-10-21T17:31:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.