A Fast Randomized Algorithm for Massive Text Normalization
- URL: http://arxiv.org/abs/2110.03024v1
- Date: Wed, 6 Oct 2021 19:18:17 GMT
- Title: A Fast Randomized Algorithm for Massive Text Normalization
- Authors: Nan Jiang, Chen Luo, Vihan Lakshman, Yesh Dattatreya, Yexiang Xue
- Abstract summary: We present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data.
Our algorithm relies on the Jaccard similarity between words to suggest correction results.
Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.
- Score: 26.602776972067936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many popular machine learning techniques in natural language processing and
data mining rely heavily on high-quality text sources. However real-world text
datasets contain a significant amount of spelling errors and improperly
punctuated variants where the performance of these models would quickly
deteriorate. Moreover, real-world, web-scale datasets contain hundreds of
millions or even billions of lines of text, where the existing text cleaning
tools are prohibitively expensive to execute over and may require an overhead
to learn the corrections. In this paper, we present FLAN, a scalable randomized
algorithm to clean and canonicalize massive text data. Our algorithm relies on
the Jaccard similarity between words to suggest correction results. We
efficiently handle the pairwise word-to-word comparisons via Locality Sensitive
Hashing (LSH). We also propose a novel stabilization process to address the
issue of hash collisions between dissimilar words, which is a consequence of
the randomized nature of LSH and is exacerbated by the massive scale of
real-world datasets. Compared with existing approaches, our method is more
efficient, both asymptotically and in empirical evaluations, and does not rely
on additional features, such as lexical/phonetic similarity or word embedding
features. In addition, FLAN does not require any annotated data or supervised
learning. We further theoretically show the robustness of our algorithm with
upper bounds on the false positive and false negative rates of corrections. Our
experimental results on real-world datasets demonstrate the efficiency and
efficacy of FLAN.
Related papers
- Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression [15.460141768587663]
We propose a lightweight supervised dictionary learning framework for text classification based on data compression and representation.
We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance.
arXiv Detail & Related papers (2024-04-28T10:11:52Z) - GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification [0.0]
The processing of text data requires embedding, a method of translating the content of the text to numeric vectors.
A new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed.
The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms.
arXiv Detail & Related papers (2024-04-25T18:48:11Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms.
We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z) - Simple Alternating Minimization Provably Solves Complete Dictionary
Learning [13.056764072568749]
This paper focuses on complete dictionary problem, where the goal is to reparametrize a set of given signals as linear combinations of atoms from a learned dictionary.
There are two main challenges faced by theoretical and practical dictionary learning: the lack of theoretical guarantees for practically-used algorithms, and poor scalability when dealing with huge-scale datasets.
arXiv Detail & Related papers (2022-10-23T18:30:45Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON)
First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z) - ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene
Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts.
Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.