Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising
- URL: http://arxiv.org/abs/2209.08759v1
- Date: Mon, 19 Sep 2022 04:49:51 GMT
- Title: Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising
- Authors: Tan Yu and Jie Liu and Yi Yang and Yi Li and Hongliang Fei and Ping Li
- Abstract summary: How to pair the video ads with the user search is the core task of Baidu video advertising.
Due to the modality gap, the query-to-video retrieval is much more challenging than traditional query-to-document retrieval.
We present a tree-based combo-attention network (TCAN) which has been recently launched in Baidu's dynamic video advertising platform.
- Score: 58.09698019028931
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancement of the communication technology and the popularity of the
smart phones foster the booming of video ads. Baidu, as one of the leading
search engine companies in the world, receives billions of search queries per
day. How to pair the video ads with the user search is the core task of Baidu
video advertising. Due to the modality gap, the query-to-video retrieval is
much more challenging than traditional query-to-document retrieval and
image-to-image search. Traditionally, the query-to-video retrieval is tackled
by the query-to-title retrieval, which is not reliable when the quality of
tiles are not high. With the rapid progress achieved in computer vision and
natural language processing in recent years, content-based search methods
becomes promising for the query-to-video retrieval. Benefited from pretraining
on large-scale datasets, some visionBERT methods based on cross-modal attention
have achieved excellent performance in many vision-language tasks not only in
academia but also in industry. Nevertheless, the expensive computation cost of
cross-modal attention makes it impractical for large-scale search in industrial
applications. In this work, we present a tree-based combo-attention network
(TCAN) which has been recently launched in Baidu's dynamic video advertising
platform. It provides a practical solution to deploy the heavy cross-modal
attention for the large-scale query-to-video search. After launching tree-based
combo-attention network, click-through rate gets improved by 2.29\% and
conversion rate get improved by 2.63\%.
Related papers
- ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising [2.330164376631038]
Contextual advertising serves ads that are aligned to the content that the user is viewing.
Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources.
We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising.
arXiv Detail & Related papers (2024-10-29T17:01:05Z) - MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval [30.48217069475297]
We introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers.
T2VIndexer aims to reduce retrieval time while maintaining high accuracy.
arXiv Detail & Related papers (2024-08-21T08:40:45Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval.
This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches.
Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.