Cursive Caption Text Detection in Videos
- URL: http://arxiv.org/abs/2301.03164v1
- Date: Mon, 9 Jan 2023 04:30:48 GMT
- Title: Cursive Caption Text Detection in Videos
- Authors: Ali Mirza, Imran Siddiqi
- Abstract summary: This paper presents a robust technique for detection of textual content appearing in video frames.
We target text in cursive script taking Urdu text as a case study.
Since it is common to have videos with caption text in multiple-scripts, cursive text is distinguished from Latin text using a script-identification module.
- Score: 5.117030416610515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Textual content appearing in videos represents an interesting index for
semantic retrieval of videos (from archives), generation of alerts (live
streams) as well as high level applications like opinion mining and content
summarization. One of the key components of such systems is the detection of
textual content in video frames and the same makes the subject of our present
study. This paper presents a robust technique for detection of textual content
appearing in video frames. More specifically we target text in cursive script
taking Urdu text as a case study. Detection of textual regions in video frames
is carried out by fine-tuning object detectors based on deep convolutional
neural networks for the specific case of text detection. Since it is common to
have videos with caption text in multiple-scripts, cursive text is
distinguished from Latin text using a script-identification module. Finally,
detection and script identification are combined in a single end-to-end
trainable system. Experiments on a comprehensive dataset of around 11,000 video
frames report an F-measure of 0.91.
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Contextual Text Block Detection towards Scene Text Understanding [85.40898487745272]
This paper presents contextual text detection, a new setup that detects contextual text blocks (CTBs) for better understanding of texts in scenes.
We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence.
arXiv Detail & Related papers (2022-07-26T14:59:25Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Video Text Tracking With a Spatio-Temporal Complementary Model [46.99051486905713]
Text tracking is to track multiple texts in a video,and construct a trajectory for each text.
Existing methodle this task by utilizing the tracking-by-detection frame-work.
We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios.
arXiv Detail & Related papers (2021-11-09T08:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.