Related papers: A Fast Randomized Algorithm for Massive Text Normalization

A Fast Randomized Algorithm for Massive Text Normalization

URL: http://arxiv.org/abs/2110.03024v1
Date: Wed, 6 Oct 2021 19:18:17 GMT
Title: A Fast Randomized Algorithm for Massive Text Normalization
Authors: Nan Jiang, Chen Luo, Vihan Lakshman, Yesh Dattatreya, Yexiang Xue
Abstract summary: We present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.
Score: 26.602776972067936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. In addition, FLAN does not require any annotated data or supervised learning. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.

Related papers

Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression [15.460141768587663]
We propose a lightweight supervised dictionary learning framework for text classification based on data compression and representation. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance.
arXiv Detail & Related papers (2024-04-28T10:11:52Z)
GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification [0.0]
The processing of text data requires embedding, a method of translating the content of the text to numeric vectors. A new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms.
arXiv Detail & Related papers (2024-04-25T18:48:11Z)
Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z)
A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms. We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z)
Simple Alternating Minimization Provably Solves Complete Dictionary Learning [13.056764072568749]
This paper focuses on complete dictionary problem, where the goal is to reparametrize a set of given signals as linear combinations of atoms from a learned dictionary. There are two main challenges faced by theoretical and practical dictionary learning: the lack of theoretical guarantees for practically-used algorithms, and poor scalability when dealing with huge-scale datasets.
arXiv Detail & Related papers (2022-10-23T18:30:45Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Text Counterfactuals via Latent Optimization and Shapley-Guided Search [15.919650185010491]
We study the problem of generating counterfactual text for a classification model. We aim to minimally alter the text to change the model's prediction. White-box approaches have been successfully applied to similar problems in vision.
arXiv Detail & Related papers (2021-10-22T05:04:40Z)
CIMON: Towards High-quality Hash Codes [63.37321228830102]
We propose a new method named textbfComprehensive stextbfImilarity textbfMining and ctextbfOnsistency leartextbfNing (CIMON) First, we use global refinement and similarity statistical distribution to obtain reliable and smooth guidance. Second, both semantic and contrastive consistency learning are introduced to derive both disturb-invariant and discriminative hash codes.
arXiv Detail & Related papers (2020-10-15T14:47:14Z)
ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts. Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.