Looking and Listening: Audio Guided Text Recognition
- URL: http://arxiv.org/abs/2306.03482v1
- Date: Tue, 6 Jun 2023 08:08:18 GMT
- Title: Looking and Listening: Audio Guided Text Recognition
- Authors: Wenwen Yu, Mingyu Liu, Biao Yang, Enming Zhang, Deqiang Jiang, Xing
Sun, Yuliang Liu, Xiang Bai
- Abstract summary: Text recognition in the wild is a long-standing problem in computer vision.
Recent studies suggest vision and language processing are effective for scene text recognition.
Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches.
We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
- Score: 62.98768236858089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text recognition in the wild is a long-standing problem in computer vision.
Driven by end-to-end deep learning, recent studies suggest vision and language
processing are effective for scene text recognition. Yet, solving edit errors
such as add, delete, or replace is still the main challenge for existing
approaches. In fact, the content of the text and its audio are naturally
corresponding to each other, i.e., a single character error may result in a
clear different pronunciation. In this paper, we propose the AudioOCR, a simple
yet effective probabilistic audio decoder for mel spectrogram sequence
prediction to guide the scene text recognition, which only participates in the
training phase and brings no extra cost during the inference stage. The
underlying principle of AudioOCR can be easily applied to the existing
approaches. Experiments using 7 previous scene text recognition methods on 12
existing regular, irregular, and occluded benchmarks demonstrate our proposed
method can bring consistent improvement. More importantly, through our
experimentation, we show that AudioOCR possesses a generalizability that
extends to more challenging scenarios, including recognizing non-English text,
out-of-vocabulary words, and text with various accents. Code will be available
at https://github.com/wenwenyu/AudioOCR.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - An efficient text augmentation approach for contextualized Mandarin speech recognition [4.600045052545344]
Our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models.
To contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data.
Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance.
arXiv Detail & Related papers (2024-06-14T11:53:14Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Implicit Feature Alignment: Learn to Convert Text Recognizer to Text
Spotter [38.4211220941874]
We propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA)
IFA can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFAinference.
We experimentally demonstrate that IFA achieves state-of-the-art performance on end-to-end document recognition tasks.
arXiv Detail & Related papers (2021-06-10T17:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.