On the performance of phonetic algorithms in microtext normalization
- URL: http://arxiv.org/abs/2402.02591v1
- Date: Sun, 4 Feb 2024 19:54:44 GMT
- Title: On the performance of phonetic algorithms in microtext normalization
- Authors: Yerai Doval, Manuel Vilares, Jes\'us Vilares
- Abstract summary: microtext normalization is a preprocessing step for non-standard microtexts.
phonetic algorithms can be used to transform microtexts into standard texts.
The aim of this study is to determine the best phonetic algorithms within the context of candidate generation.
- Score: 0.5755004576310332
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: User-generated content published on microblogging social networks constitutes
a priceless source of information. However, microtexts usually deviate from the
standard lexical and grammatical rules of the language, thus making its
processing by traditional intelligent systems very difficult. As an answer,
microtext normalization consists in transforming those non-standard microtexts
into standard well-written texts as a preprocessing step, allowing traditional
approaches to continue with their usual processing. Given the importance of
phonetic phenomena in non-standard text formation, an essential element of the
knowledge base of a normalizer would be the phonetic rules that encode these
phenomena, which can be found in the so-called phonetic algorithms.
In this work we experiment with a wide range of phonetic algorithms for the
English language. The aim of this study is to determine the best phonetic
algorithms within the context of candidate generation for microtext
normalization. In other words, we intend to find those algorithms that taking
as input non-standard terms to be normalized allow us to obtain as output the
smallest possible sets of normalization candidates which still contain the
corresponding target standard words. As it will be stated, the choice of the
phonetic algorithm will depend heavily on the capabilities of the candidate
selection mechanism which we usually find at the end of a microtext
normalization pipeline. The faster it can make the right choices among big
enough sets of candidates, the more we can sacrifice on the precision of the
phonetic algorithms in favour of coverage in order to increase the overall
performance of the normalization system.
KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching;
Twitter; texting
Related papers
- Phonetically rich corpus construction for a low-resourced language [0.0]
This paper proposes a novel approach to create a textitcorpus with broad phonetic coverage for a low-resourced language.
Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution.
Using our algorithm, we achieve a 55.8% higher percentage of distinct triphones -- for samples of similar size.
arXiv Detail & Related papers (2024-02-08T16:36:11Z) - Looking and Listening: Audio Guided Text Recognition [62.98768236858089]
Text recognition in the wild is a long-standing problem in computer vision.
Recent studies suggest vision and language processing are effective for scene text recognition.
Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches.
We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
arXiv Detail & Related papers (2023-06-06T08:08:18Z) - An End-to-end Chinese Text Normalization Model based on Rule-guided
Flat-Lattice Transformer [37.0774363352316]
We propose an end-to-end Chinese text normalization model, which accepts Chinese characters as direct input.
We also release a first publicly accessible largescale dataset for Chinese text normalization.
arXiv Detail & Related papers (2022-03-31T11:19:53Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Determinantal Beam Search [75.84501052642361]
Beam search is a go-to strategy for decoding neural sequence models.
In use-cases that call for multiple solutions, a diverse or representative set is often desired.
By posing iterations in beam search as a series of subdeterminant problems, we can turn the algorithm into a diverse subset selection process.
arXiv Detail & Related papers (2021-06-14T13:01:46Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for
Information Retrieval and Stance Detection [4.20380265888641]
We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention.
The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise.
arXiv Detail & Related papers (2021-01-09T06:57:09Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Normalizing Text using Language Modelling based on Phonetics and String
Similarity [0.0]
We propose a new robust model to perform text normalization.
We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form.
Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z) - Investigating Label Bias in Beam Search for Open-ended Text Generation [8.331919991368366]
In open-ended text generation, beam search is often found to produce repetitive and generic texts.
Standard seq2seq models suffer from label bias due to its locally normalized probability formulation.
By combining locally normalized maximum likelihood estimation and globally normalized sequence-level training, label bias can be reduced with almost no sacrifice in perplexity.
arXiv Detail & Related papers (2020-05-22T05:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.