Normalizing Text using Language Modelling based on Phonetics and String
Similarity
- URL: http://arxiv.org/abs/2006.14116v1
- Date: Thu, 25 Jun 2020 00:42:39 GMT
- Title: Normalizing Text using Language Modelling based on Phonetics and String
Similarity
- Authors: Fenil Doshi, Jimit Gandhi, Deep Gosalia and Sudhir Bagul
- Abstract summary: We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media networks and chatting platforms often use an informal version of
natural text. Adversarial spelling attacks also tend to alter the input text by
modifying the characters in the text. Normalizing these texts is an essential
step for various applications like language translation and text to speech
synthesis where the models are trained over clean regular English language. We
propose a new robust model to perform text normalization.
Our system uses the BERT language model to predict the masked words that
correspond to the unnormalized words. We propose two unique masking strategies
that try to replace the unnormalized words in the text with their root form
using a unique score based on phonetic and string similarity metrics.We use
human-centric evaluations where volunteers were asked to rank the normalized
text. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the
effectiveness of our system in dealing with text normalization.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - On the performance of phonetic algorithms in microtext normalization [0.5755004576310332]
microtext normalization is a preprocessing step for non-standard microtexts.
phonetic algorithms can be used to transform microtexts into standard texts.
The aim of this study is to determine the best phonetic algorithms within the context of candidate generation.
arXiv Detail & Related papers (2024-02-04T19:54:44Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - A Chat About Boring Problems: Studying GPT-based text normalization [22.64840464909988]
We show the capacity of Large-Language Models for text normalization in few-shot scenarios.
We find LLM based text normalization to achieve error rates around 40% lower than top normalization systems.
We create a new taxonomy of text normalization errors and apply it to results from GPT-3.5-Turbo and GPT-4.0.
arXiv Detail & Related papers (2023-09-23T16:32:59Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Neural semi-Markov CRF for Monolingual Word Alignment [20.897157172049877]
We present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans.
We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models.
arXiv Detail & Related papers (2021-06-04T16:04:00Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages
with Abugida Scripts [0.0]
Abugida refers to a phonogram writing system where each syllable is represented using a single consonant or typographic ligature.
We propose a disambiguation algorithm and showcase its usefulness in two novel input methods for languages using the abugida writing system.
We show an improvement in typing speed by 19.49%, 25.13%, and 14.89%, in Hindi, Bengali, and Thai, respectively, using Ambiguous Input.
arXiv Detail & Related papers (2021-01-05T03:16:34Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.