A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text
Spotter with Transformer
- URL: http://arxiv.org/abs/2112.04888v1
- Date: Thu, 9 Dec 2021 13:21:26 GMT
- Title: A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text
Spotter with Transformer
- Authors: Weijia Wu, Yuanqiang Cai, Debing Zhang, Sibo Wang, Zhuang Li, Jiahong
Li, Yejun Tang, Hong Zhou
- Abstract summary: We introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText)
Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos.
Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc.
- Score: 12.167938646139705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing video text spotting benchmarks focus on evaluating a single
language and scenario with limited data. In this work, we introduce a
large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There
are four features for BOVText. Firstly, we provide 2,000+ videos with more than
1,750,000+ frames, 25 times larger than the existing largest dataset with
incidental text in videos. Secondly, our dataset covers 30+ open categories
with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie,
etc. Thirdly, abundant text types annotation (i.e., title, caption or scene
text) are provided for the different representational meanings in video.
Fourthly, the BOVText provides bilingual text annotation to promote multiple
cultures live and communication. Besides, we propose an end-to-end video text
spotting framework with Transformer, termed TransVTSpotter, which solves the
multi-orient text spotting in video with a simple, but efficient
attention-based query-key mechanism. It applies object features from the
previous frame as a tracking query for the current frame and introduces a
rotation angle prediction to fit the multiorient text instance. On
ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with
44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at
github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter,
respectively.
Related papers
- Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z) - DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and
Small Text [46.177941541282756]
We establish a video text reading benchmark, named DSText V2, which focuses on Dense and Small text reading challenges in the video.
Compared with the previous datasets, the proposed dataset mainly include three new challenges.
High-proportioned small texts, coupled with the blurriness and distortion in the video, will bring further challenges.
arXiv Detail & Related papers (2023-11-29T09:13:27Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - ICDAR 2023 Video Text Reading Competition for Dense and Small Text [61.138557702185274]
We establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video.
Compared with the previous datasets, the proposed dataset mainly include three new challenges.
The proposed DSText includes 100 video clips from 12 open scenarios, supporting two tasks (i.e., video text tracking (Task 1) and end-to-end video text spotting (Task 2)
arXiv Detail & Related papers (2023-04-10T04:20:34Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning [153.54638582696128]
We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning.
We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa.
BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
arXiv Detail & Related papers (2022-06-21T16:02:12Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z) - CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text
Retrieval [14.022356429411934]
We present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods.
Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.
arXiv Detail & Related papers (2021-11-10T10:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.