Dual Encoding for Video Retrieval by Text
- URL: http://arxiv.org/abs/2009.05381v2
- Date: Thu, 18 Feb 2021 09:26:20 GMT
- Title: Dual Encoding for Video Retrieval by Text
- Authors: Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang,
Meng Wang
- Abstract summary: We propose a dual deep encoding network that encodes videos and queries into powerful dense representations of their own.
Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding.
Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning.
- Score: 49.34356217787656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper attacks the challenging problem of video retrieval by text. In
such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc
queries described exclusively in the form of a natural-language sentence, with
no visual example provided. Given videos as sequences of frames and queries as
sequences of words, an effective sequence-to-sequence cross-modal matching is
crucial. To that end, the two modalities need to be first encoded into
real-valued vectors and then projected into a common space. In this paper we
achieve this by proposing a dual deep encoding network that encodes videos and
queries into powerful dense representations of their own. Our novelty is
two-fold. First, different from prior art that resorts to a specific
single-level encoder, the proposed network performs multi-level encoding that
represents the rich content of both modalities in a coarse-to-fine fashion.
Second, different from a conventional common space learning algorithm which is
either concept based or latent space based, we introduce hybrid space learning
which combines the high performance of the latent space and the good
interpretability of the concept space. Dual encoding is conceptually simple,
practically effective and end-to-end trained with hybrid space learning.
Extensive experiments on four challenging video datasets show the viability of
the new method.
Related papers
- When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries [14.230048035478267]
Ad-hoc Video Search (AVS) is a core theme in multimedia data management and retrieval.
This paper develops a new and general method for effectively exploiting diverse sentence encoders.
The novelty of the proposed method, which we term Sentence Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces.
arXiv Detail & Related papers (2020-11-24T13:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.