SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries
- URL: http://arxiv.org/abs/2011.12091v1
- Date: Tue, 24 Nov 2020 13:54:28 GMT
- Title: SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries
- Authors: Xirong Li and Fangming Zhou and Chaoxi Xu and Jiaqi Ji and Gang Yang
- Abstract summary: Ad-hoc Video Search (AVS) is a core theme in multimedia data management and retrieval.
This paper develops a new and general method for effectively exploiting diverse sentence encoders.
The novelty of the proposed method, which we term Sentence Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces.
- Score: 14.230048035478267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search
(AVS), is a core theme in multimedia data management and retrieval. The success
of AVS counts on cross-modal representation learning that encodes both query
sentences and videos into common spaces for semantic similarity computation.
Inspired by the initial success of previously few works in combining multiple
sentence encoders, this paper takes a step forward by developing a new and
general method for effectively exploiting diverse sentence encoders. The
novelty of the proposed method, which we term Sentence Encoder Assembly (SEA),
is two-fold. First, different from prior art that use only a single common
space, SEA supports text-video matching in multiple encoder-specific common
spaces. Such a property prevents the matching from being dominated by a
specific encoder that produces an encoding vector much longer than other
encoders. Second, in order to explore complementarities among the individual
common spaces, we propose multi-space multi-loss learning. As extensive
experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD)
show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to
implement. All this makes SEA an appealing solution for AVS and promising for
continuously advancing the task by harvesting new sentence encoders.
Related papers
- Triple-Encoders: Representations That Fire Together, Wire Together [51.15206713482718]
Contrastive Learning is a representation learning method that encodes relative distances between utterances into the embedding space via a bi-encoder.
This study introduces triple-encoders, which efficiently compute distributed utterance mixtures from these independently encoded utterances.
We find that triple-encoders lead to a substantial improvement over bi-encoders, and even to better zero-shot generalization than single-vector representation models.
arXiv Detail & Related papers (2024-02-19T18:06:02Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Learning to Compose Representations of Different Encoder Layers towards
Improving Compositional Generalization [29.32436551704417]
We propose textscCompoSition (textbfCompose textbfSyntactic and Semanttextbfic Representatextbftions)
textscCompoSition achieves competitive results on two comprehensive and realistic benchmarks.
arXiv Detail & Related papers (2023-05-20T11:16:59Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Dual Encoding for Video Retrieval by Text [49.34356217787656]
We propose a dual deep encoding network that encodes videos and queries into powerful dense representations of their own.
Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding.
Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning.
arXiv Detail & Related papers (2020-09-10T15:49:39Z) - Consistent Multiple Sequence Decoding [36.46573114422263]
We introduce a consistent multiple sequence decoding architecture.
This architecture allows for consistent and simultaneous decoding of an arbitrary number of sequences.
We show the efficacy of our consistent multiple sequence decoder on the task of dense relational image captioning.
arXiv Detail & Related papers (2020-04-02T00:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.