EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition
- URL: http://arxiv.org/abs/2502.09020v1
- Date: Thu, 13 Feb 2025 07:16:16 GMT
- Title: EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition
- Authors: Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang,
- Abstract summary: Scene Text Recognition algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds.
We propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR.
We also propose a new event-based scene text recognition framework, termed SimC-ESTR.
- Score: 39.12227212510573
- License:
- Abstract: Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on https://github.com/Event-AHU/EventSTR
Related papers
- EventGPT: Event Stream Understanding with Multimodal Large Language Models [59.65010502000344]
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions.
Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better.
We introduce EventGPT, the first MLLM for event stream understanding.
arXiv Detail & Related papers (2024-12-01T14:38:40Z) - Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input [8.365349007799296]
Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses.
This paper reports a method for creating new labelled event datasets by using a text-to-X model.
We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements.
arXiv Detail & Related papers (2024-06-05T16:34:12Z) - Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large
Vision-Language Models [15.231177830711077]
We introduce a novel pattern recognition framework that consolidates semantic labels, RGB frames, and event streams.
To handle the semantic labels, we convert them into language descriptions through prompt engineering.
We integrate the RGB/Event features and semantic features using multimodal Transformer networks.
arXiv Detail & Related papers (2023-11-30T14:35:51Z) - GET: Group Event Transformer for Event-Based Vision [82.312736707534]
Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention.
We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET)
GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
arXiv Detail & Related papers (2023-10-04T08:02:33Z) - Event Stream-based Visual Object Tracking: A High-Resolution Benchmark
Dataset and A Novel Baseline [38.42400442371156]
Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker.
We propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer.
We propose the first large-scale high-resolution ($1280 times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc.
arXiv Detail & Related papers (2023-09-26T01:42:26Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding [7.797154022794006]
EventBind is a novel framework that unleashes the potential of vision-language models (VLMs) for event-based recognition.
We first introduce a novel event encoder that subtly models the temporal information from events.
We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability.
arXiv Detail & Related papers (2023-08-06T15:05:42Z) - Geometric Perception based Efficient Text Recognition [0.0]
In real-world applications with fixed camera positions, the underlying data tends to be regular scene text.
This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models.
We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
arXiv Detail & Related papers (2023-02-08T04:19:24Z) - CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models.
We take advantage of text information extraction technologies to obtain event structural knowledge.
Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.