Normalizing Text using Language Modelling based on Phonetics and String
Similarity
- URL: http://arxiv.org/abs/2006.14116v1
- Date: Thu, 25 Jun 2020 00:42:39 GMT
- Title: Normalizing Text using Language Modelling based on Phonetics and String
Similarity
- Authors: Fenil Doshi, Jimit Gandhi, Deep Gosalia and Sudhir Bagul
- Abstract summary: We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media networks and chatting platforms often use an informal version of
natural text. Adversarial spelling attacks also tend to alter the input text by
modifying the characters in the text. Normalizing these texts is an essential
step for various applications like language translation and text to speech
synthesis where the models are trained over clean regular English language. We
propose a new robust model to perform text normalization.
Our system uses the BERT language model to predict the masked words that
correspond to the unnormalized words. We propose two unique masking strategies
that try to replace the unnormalized words in the text with their root form
using a unique score based on phonetic and string similarity metrics.We use
human-centric evaluations where volunteers were asked to rank the normalized
text. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the
effectiveness of our system in dealing with text normalization.
Related papers
- Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z) - FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - On the performance of phonetic algorithms in microtext normalization [0.5755004576310332]
microtext normalization is a preprocessing step for non-standard microtexts.
phonetic algorithms can be used to transform microtexts into standard texts.
The aim of this study is to determine the best phonetic algorithms within the context of candidate generation.
arXiv Detail & Related papers (2024-02-04T19:54:44Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - A Chat About Boring Problems: Studying GPT-based text normalization [22.64840464909988]
We show the capacity of Large-Language Models for text normalization in few-shot scenarios.
We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems.
We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
arXiv Detail & Related papers (2023-09-23T16:32:59Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Neural semi-Markov CRF for Monolingual Word Alignment [20.897157172049877]
We present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans.
We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models.
arXiv Detail & Related papers (2021-06-04T16:04:00Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.