MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
- URL: http://arxiv.org/abs/2506.10609v1
- Date: Thu, 12 Jun 2025 11:54:13 GMT
- Title: MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
- Authors: Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu,
- Abstract summary: Muti-query Scene Text retrieval with Attention Recycling (MSTAR) is a box-free approach for scene text retrieval.<n>It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts.<n>Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset.
- Score: 58.251621637466904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
Related papers
- jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval [5.587329786636647]
We introduce jina-embeddings-v4, a multimodal embedding model that unifies text and image representations.<n>The model incorporates task-specific Low-Rank Adaptation (LoRA) adapters to optimize performance across diverse retrieval scenarios.<n>To facilitate evaluation of this capability, we also introduce Jina-VDR, a novel benchmark specifically designed for visually rich image retrieval.
arXiv Detail & Related papers (2025-06-23T17:59:55Z) - MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query [55.486895951981566]
MERIT is the first multilingual dataset for interleaved multi-condition semantic retrieval.<n>This paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval.
arXiv Detail & Related papers (2025-06-03T17:59:14Z) - MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark [1.8448587047759064]
We introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark.<n> MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset.<n>We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities.
arXiv Detail & Related papers (2025-05-16T19:22:19Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.