VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
- URL: http://arxiv.org/abs/2404.19652v3
- Date: Tue, 14 May 2024 15:07:05 GMT
- Title: VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
- Authors: Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen Jin, Xiang Bai,
- Abstract summary: VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
- Score: 115.64739269488965
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.
Related papers
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Foundation Model is Efficient Multimodal Multitask Model Selector [47.017463595702274]
A brute-force approach is to finetune all models on all target datasets, bringing high computational costs.
We propose an efficient multi-task model selector (EMMS) to transform diverse label formats into a unified noisy label embedding.
EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario.
arXiv Detail & Related papers (2023-08-11T17:54:44Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Semi-Parametric Video-Grounded Text Generation [21.506377836451577]
In this paper, we propose a semi-parametric video-grounded text generation model, SeViT.
Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames.
Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding.
arXiv Detail & Related papers (2023-01-27T03:00:43Z) - MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval [60.454321238910474]
State-of-the-art video-text retrieval methods typically involve fully fine-tuning a pre-trained model on specific datasets.
We present our pioneering work that enables parameter-efficient VTR using a pre-trained model.
We propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text.
arXiv Detail & Related papers (2023-01-19T03:42:56Z) - PolyViT: Co-training Vision Transformers on Images, Videos and Audio [80.0913507142036]
We present PolyViT, a model trained on image, audio and video.
By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task.
We show that co-training is simple and practical to implement.
arXiv Detail & Related papers (2021-11-25T10:01:05Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.