Contrastive Learning of Semantic and Visual Representations for Text
Tracking
- URL: http://arxiv.org/abs/2112.14976v1
- Date: Thu, 30 Dec 2021 09:22:13 GMT
- Title: Contrastive Learning of Semantic and Visual Representations for Text
Tracking
- Authors: Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan
Wang, Hong Zhou
- Abstract summary: In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations.
We present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence.
With a backbone of ResNet-18, SVRep achieves an $rm ID_F1$ of $textbf65.9%$, running at $textbf16.7$ FPS, on the ICDAR2015(video) dataset.
- Score: 22.817884815010856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic representation is of great benefit to the video text tracking(VTT)
task that requires simultaneously classifying, detecting, and tracking texts in
the video. Most existing approaches tackle this task by appearance similarity
in continuous frames, while ignoring the abundant semantic features. In this
paper, we explore to robustly track video text with contrastive learning of
semantic and visual representations. Correspondingly, we present an end-to-end
video text tracker with Semantic and Visual Representations(SVRep), which
detects and tracks texts by exploiting the visual and semantic relationships
between different texts in a video sequence. Besides, with a light-weight
architecture, SVRep achieves state-of-the-art performance while maintaining
competitive inference speed. Specifically, with a backbone of ResNet-18, SVRep
achieves an ${\rm ID_{F1}}$ of $\textbf{65.9\%}$, running at $\textbf{16.7}$
FPS, on the ICDAR2015(video) dataset with $\textbf{8.6\%}$ improvement than the
previous state-of-the-art methods.
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Real-time End-to-End Video Text Spotter with Contrastive Representation
Learning [91.15406440999939]
We propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText)
CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.
A simple, lightweight architecture is designed for effective and accurate performance.
arXiv Detail & Related papers (2022-07-18T07:54:17Z) - End-to-End Video Text Spotting with Transformer [86.46724646835627]
We propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR)
TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition)
arXiv Detail & Related papers (2022-03-20T12:14:58Z) - Video Text Tracking With a Spatio-Temporal Complementary Model [46.99051486905713]
Text tracking is to track multiple texts in a video,and construct a trajectory for each text.
Existing methodle this task by utilizing the tracking-by-detection frame-work.
We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios.
arXiv Detail & Related papers (2021-11-09T08:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.