Multi-language Video Subtitle Dataset for Image-based Text Recognition
- URL: http://arxiv.org/abs/2411.05043v1
- Date: Thu, 07 Nov 2024 00:06:53 GMT
- Title: Multi-language Video Subtitle Dataset for Image-based Text Recognition
- Authors: Thanadol Singkhornart, Olarik Surinta,
- Abstract summary: This dataset includes 4,224 subtitle images extracted from 24 videos sourced from online platforms.
It features a wide variety of characters, including Thai consonants, vowels, tone marks, punctuation marks, numerals, Roman characters, and Arabic numerals.
- Score: 0.0
- License:
- Abstract: The Multi-language Video Subtitle Dataset is a comprehensive collection designed to support research in text recognition across multiple languages. This dataset includes 4,224 subtitle images extracted from 24 videos sourced from online platforms. It features a wide variety of characters, including Thai consonants, vowels, tone marks, punctuation marks, numerals, Roman characters, and Arabic numerals. With 157 unique characters, the dataset provides a resource for addressing challenges in text recognition within complex backgrounds. It addresses the growing need for high-quality, multilingual text recognition data, particularly as videos with embedded subtitles become increasingly dominant on platforms like YouTube and Facebook. The variability in text length, font, and placement within these images adds complexity, offering a valuable resource for developing and evaluating deep learning models. The dataset facilitates accurate text transcription from video content while providing a foundation for improving computational efficiency in text recognition systems. As a result, it holds significant potential to drive advancements in research and innovation across various computer science disciplines, including artificial intelligence, deep learning, computer vision, and pattern recognition.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - CelebV-Text: A Large-Scale Facial Text-Video Dataset [91.22496444328151]
CelebV-Text is a large-scale, diverse, and high-quality dataset of facial text-video pairs.
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy.
The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance.
arXiv Detail & Related papers (2023-03-26T13:06:35Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.