An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for
Information Retrieval and Stance Detection
- URL: http://arxiv.org/abs/2101.03303v1
- Date: Sat, 9 Jan 2021 06:57:09 GMT
- Title: An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for
Information Retrieval and Stance Detection
- Authors: Anurag Roy, Shalmoli Ghosh, Kripabandhu Ghosh, Saptarshi Ghosh
- Abstract summary: We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention.
The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise.
- Score: 4.20380265888641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large fraction of textual data available today contains various types of
'noise', such as OCR noise in digitized documents, noise due to informal
writing style of users on microblogging sites, and so on. To enable tasks such
as search/retrieval and classification over all the available data, we need
robust algorithms for text normalization, i.e., for cleaning different kinds of
noise in the text. There have been several efforts towards cleaning or
normalizing noisy text; however, many of the existing text normalization
methods are supervised and require language-dependent resources or large
amounts of training data that is difficult to obtain. We propose an
unsupervised algorithm for text normalization that does not need any training
data / human intervention. The proposed algorithm is applicable to text over
different languages, and can handle both machine-generated and human-generated
noise. Experiments over several standard datasets show that text normalization
through the proposed algorithm enables better retrieval and stance detection,
as compared to that using several baseline text normalization methods.
Implementation of our algorithm can be found at
https://github.com/ranarag/UnsupClean.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - GuideWalk: A Novel Graph-Based Word Embedding for Enhanced Text Classification [0.0]
The processing of text data requires embedding, a method of translating the content of the text to numeric vectors.
A new text embedding approach, namely the Guided Transition Probability Matrix (GTPM) model is proposed.
The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms.
arXiv Detail & Related papers (2024-04-25T18:48:11Z) - On the performance of phonetic algorithms in microtext normalization [0.5755004576310332]
microtext normalization is a preprocessing step for non-standard microtexts.
phonetic algorithms can be used to transform microtexts into standard texts.
The aim of this study is to determine the best phonetic algorithms within the context of candidate generation.
arXiv Detail & Related papers (2024-02-04T19:54:44Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - A Deep Learning Anomaly Detection Method in Textual Data [0.45687771576879593]
We propose using deep learning and transformer architectures combined with classical machine learning algorithms.
We used multiple machine learning methods such as Sentence Transformers, Autos, Logistic Regression and Distance calculation methods to predict anomalies.
arXiv Detail & Related papers (2022-11-25T05:18:13Z) - Composable Text Controls in Latent Space with ODEs [97.12426987887021]
This paper proposes a new efficient approach for composable text operations in the compact latent space of text.
By connecting pretrained LMs to the latent space through efficient adaption, we then decode the sampled vectors into desired text sequences.
Experiments show that composing those operators within our approach manages to generate or edit high-quality text.
arXiv Detail & Related papers (2022-08-01T06:51:45Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - A Fast Randomized Algorithm for Massive Text Normalization [26.602776972067936]
We present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data.
Our algorithm relies on the Jaccard similarity between words to suggest correction results.
Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.
arXiv Detail & Related papers (2021-10-06T19:18:17Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z) - Contextual Text Denoising with Masked Language Models [21.923035129334373]
We propose a new contextual text denoising algorithm based on the ready-to-use masked language model.
The proposed algorithm does not require retraining of the model and can be integrated into any NLP system.
arXiv Detail & Related papers (2019-10-30T18:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.