CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
Video Search Scenarios
- URL: http://arxiv.org/abs/2401.10475v2
- Date: Thu, 25 Jan 2024 06:58:17 GMT
- Title: CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short
Video Search Scenarios
- Authors: Xiangshuo Qiao, Xianxin Li, Xiaozhe Qu, Jie Zhang, Yang Liu, Yu Luo,
Cihang Jin, Jin Ma
- Abstract summary: Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval.
We establish the first large-scale cover-text benchmark for Chinese short video search scenarios.
UniCLIP has been deployed to Tencent's online video search systems with hundreds of millions of visits and achieved significant gains.
- Score: 15.058793892803008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models pre-trained on large-scale image-text datasets have
shown superior performance in downstream tasks such as image retrieval. Most of
the images for pre-training are presented in the form of open domain
common-sense visual elements. Differently, video covers in short video search
scenarios are presented as user-originated contents that provide important
visual summaries of videos. In addition, a portion of the video covers come
with manually designed cover texts that provide semantic complements. In order
to fill in the gaps in short video cover data, we establish the first
large-scale cover-text benchmark for Chinese short video search scenarios.
Specifically, we release two large-scale datasets CBVS-5M/10M to provide short
video covers, and the manual fine-labeling dataset CBVS-20K to provide real
user queries, which serves as an image-text benchmark test in the Chinese short
video search field. To integrate the semantics of cover text in the case of
modality missing, we propose UniCLIP where cover texts play a guiding role
during training, however are not relied upon by inference. Extensive evaluation
on CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has
been deployed to Tencent's online video search systems with hundreds of
millions of visits and achieved significant gains. The dataset and code are
available at https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP.
Related papers
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.
We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.
Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval [41.13561065438316]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video, which could range anywhere from moment-by-moment detail to a single phrase summary.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z) - Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.