GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
- URL: http://arxiv.org/abs/2505.22228v1
- Date: Wed, 28 May 2025 11:02:45 GMT
- Title: GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking
- Authors: Haibin He, Jing Zhang, Maoyuan Ye, Juhua Liu, Bo Du, Dacheng Tao,
- Abstract summary: Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking.<n>Despite progress in VTS, existing methods still fall short of the performance seen in ITS.<n>GoMatching++ transforms an off-the-shelf image text spotter into a video specialist.
- Score: 77.0306273129475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.
Related papers
- Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field.<n>We introduce RTime, a novel temporal-emphasized video-text retrieval dataset.<n>Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)<n>We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.<n>To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Real-time End-to-End Video Text Spotter with Contrastive Representation
Learning [91.15406440999939]
We propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText)
CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.
A simple, lightweight architecture is designed for effective and accurate performance.
arXiv Detail & Related papers (2022-07-18T07:54:17Z) - Contrastive Learning of Semantic and Visual Representations for Text
Tracking [22.817884815010856]
In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations.
We present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence.
With a backbone of ResNet-18, SVRep achieves an $rm ID_F1$ of $textbf65.9%$, running at $textbf16.7$ FPS, on the ICDAR2015(video) dataset.
arXiv Detail & Related papers (2021-12-30T09:22:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.