Keyword Spotting Simplified: A Segmentation-Free Approach using
Character Counting and CTC re-scoring
- URL: http://arxiv.org/abs/2308.03515v1
- Date: Mon, 7 Aug 2023 12:11:04 GMT
- Title: Keyword Spotting Simplified: A Segmentation-Free Approach using
Character Counting and CTC re-scoring
- Authors: George Retsinas, Giorgos Sfikas, Christophoros Nikou
- Abstract summary: Recent advances in segmentation-free keyword spotting treat this problem w.r.t. as an object detection paradigm.
We propose a novel segmentation-free system that efficiently scans a document image to find rectangular areas that include the query information.
- Score: 8.6134769826665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in segmentation-free keyword spotting treat this problem
w.r.t. an object detection paradigm and borrow from state-of-the-art detection
systems to simultaneously propose a word bounding box proposal mechanism and
compute a corresponding representation. Contrary to the norm of such methods
that rely on complex and large DNN models, we propose a novel segmentation-free
system that efficiently scans a document image to find rectangular areas that
include the query information. The underlying model is simple and compact,
predicting character occurrences over rectangular areas through an implicitly
learned scale map, trained on word-level annotated images. The proposed
document scanning is then performed using this character counting in a
cost-effective manner via integral images and binary search. Finally, the
retrieval similarity by character counting is refined by a pyramidal
representation and a CTC-based re-scoring algorithm, fully utilizing the
trained CNN model. Experimental validation on two widely-used datasets shows
that our method achieves state-of-the-art results outperforming the more
complex alternatives, despite the simplicity of the underlying model.
Related papers
- Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Fast and Scalable Semi-Supervised Learning for Multi-View Subspace Clustering [13.638434337947302]
FSSMSC is a novel solution to the high computational complexity commonly found in existing approaches.
The method generates a consensus anchor graph across all views, representing each data point as a sparse linear combination of chosen landmarks.
The effectiveness and efficiency of FSSMSC are validated through extensive experiments on multiple benchmark datasets of varying scales.
arXiv Detail & Related papers (2024-08-11T06:54:00Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - A Fixed-Point Approach to Unified Prompt-Based Counting [51.20608895374113]
This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for objects indicated by various prompt types, such as box, point, and text.
Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.
arXiv Detail & Related papers (2024-03-15T12:05:44Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - Finding Geometric Models by Clustering in the Consensus Space [61.65661010039768]
We propose a new algorithm for finding an unknown number of geometric models, e.g., homographies.
We present a number of applications where the use of multiple geometric models improves accuracy.
These include pose estimation from multiple generalized homographies; trajectory estimation of fast-moving objects.
arXiv Detail & Related papers (2021-03-25T14:35:07Z) - Spatial-spectral Hyperspectral Image Classification via Multiple Random
Anchor Graphs Ensemble Learning [88.60285937702304]
This paper proposes a novel spatial-spectral HSI classification method via multiple random anchor graphs ensemble learning (RAGE)
Firstly, the local binary pattern is adopted to extract the more descriptive features on each selected band, which preserves local structures and subtle changes of a region.
Secondly, the adaptive neighbors assignment is introduced in the construction of anchor graph, to reduce the computational complexity.
arXiv Detail & Related papers (2021-03-25T09:31:41Z) - Image Retrieval for Structure-from-Motion via Graph Convolutional
Network [13.040952255039702]
We present a novel retrieval method based on Graph Convolutional Network (GCN) to generate accurate pairwise matches without costly redundancy.
By constructing a subgraph surrounding the query image as input data, we adopt a learnable GCN to exploit whether nodes in the subgraph have overlapping regions with the query photograph.
Experiments demonstrate that our method performs remarkably well on the challenging dataset of highly ambiguous and duplicated scenes.
arXiv Detail & Related papers (2020-09-17T04:03:51Z) - Predicting What You Already Know Helps: Provable Self-Supervised
Learning [60.27658820909876]
Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks) without requiring labeled data.
We show a mechanism exploiting the statistical connections between certain em reconstruction-based pretext tasks that guarantee to learn a good representation.
We prove the linear layer yields small approximation error even for complex ground truth function class.
arXiv Detail & Related papers (2020-08-03T17:56:13Z) - Contrast-weighted Dictionary Learning Based Saliency Detection for
Remote Sensing Images [3.338193485961624]
We propose a novel saliency detection model based on Contrast-weighted Dictionary Learning (CDL) for remote sensing images.
Specifically, the proposed CDL learns salient and non-salient atoms from positive and negative samples to construct a discriminant dictionary.
By using the proposed joint saliency measure, a variety of saliency maps are generated based on the discriminant dictionary.
arXiv Detail & Related papers (2020-04-06T06:49:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.