SAFL: A Self-Attention Scene Text Recognizer with Focal Loss
- URL: http://arxiv.org/abs/2201.00132v1
- Date: Sat, 1 Jan 2022 06:51:03 GMT
- Title: SAFL: A Self-Attention Scene Text Recognizer with Focal Loss
- Authors: Bao Hieu Tran, Thanh Le-Cong, Huu Manh Nguyen, Duc Anh Le, Thanh Hung
Nguyen, Phi Le Nguyen
- Abstract summary: Scene text recognition remains challenging due to inherent problems such as distortions or irregular layout.
Most of the existing approaches mainly leverage recurrence or convolution-based neural networks.
We introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition.
- Score: 4.462730814123762
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the last decades, scene text recognition has gained worldwide attention
from both the academic community and actual users due to its importance in a
wide range of applications. Despite achievements in optical character
recognition, scene text recognition remains challenging due to inherent
problems such as distortions or irregular layout. Most of the existing
approaches mainly leverage recurrence or convolution-based neural networks.
However, while recurrent neural networks (RNNs) usually suffer from slow
training speed due to sequential computation and encounter problems as
vanishing gradient or bottleneck, CNN endures a trade-off between complexity
and performance. In this paper, we introduce SAFL, a self-attention-based
neural network model with the focal loss for scene text recognition, to
overcome the limitation of the existing approaches. The use of focal loss
instead of negative log-likelihood helps the model focus more on low-frequency
samples training. Moreover, to deal with the distortions and irregular texts,
we exploit Spatial TransformerNetwork (STN) to rectify text before passing to
the recognition network. We perform experiments to compare the performance of
the proposed model with seven benchmarks. The numerical results show that our
model achieves the best performance.
Related papers
- Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - Model Blending for Text Classification [0.15229257192293197]
We try reducing the complexity of state of the art LSTM models for natural language tasks such as text classification, by distilling their knowledge to CNN based models, thus reducing the inference time(or latency) during testing.
arXiv Detail & Related papers (2022-08-05T05:07:45Z) - Neural Maximum A Posteriori Estimation on Unpaired Data for Motion
Deblurring [87.97330195531029]
We propose a Neural Maximum A Posteriori (NeurMAP) estimation framework for training neural networks to recover blind motion information and sharp content from unpaired data.
The proposed NeurMAP is an approach to existing deblurring neural networks, and is the first framework that enables training image deblurring networks on unpaired datasets.
arXiv Detail & Related papers (2022-04-26T08:09:47Z) - FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types.
FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations.
We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z) - The FaceChannel: A Fast & Furious Deep Neural Network for Facial
Expression Recognition [71.24825724518847]
Current state-of-the-art models for automatic Facial Expression Recognition (FER) are based on very deep neural networks that are effective but rather expensive to train.
We formalize the FaceChannel, a light-weight neural network that has much fewer parameters than common deep neural networks.
We demonstrate how our model achieves a comparable, if not better, performance to the current state-of-the-art in FER.
arXiv Detail & Related papers (2020-09-15T09:25:37Z) - Surrogate gradients for analog neuromorphic computing [2.6475944316982942]
We show that learning self-corrects for device mismatch resulting in competitive spiking network performance on vision and speech benchmarks.
Our work sets several new benchmarks for low-energy spiking network processing on analog neuromorphic hardware.
arXiv Detail & Related papers (2020-06-12T14:45:12Z) - "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion
Recognition [0.0]
In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications.
We investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech.
arXiv Detail & Related papers (2020-05-15T19:33:40Z) - Suppressing Uncertainties for Large-Scale Facial Expression Recognition [81.51495681011404]
This paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images.
Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with textbf88.14% on RAF-DB, textbf60.23% on AffectNet, and textbf89.35% on FERPlus.
arXiv Detail & Related papers (2020-02-24T17:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.