Related papers: On the performance of phonetic algorithms in microtext normalization

On the performance of phonetic algorithms in microtext normalization

URL: http://arxiv.org/abs/2402.02591v1
Date: Sun, 4 Feb 2024 19:54:44 GMT
Title: On the performance of phonetic algorithms in microtext normalization
Authors: Yerai Doval, Manuel Vilares, Jes\'us Vilares
Abstract summary: microtext normalization is a preprocessing step for non-standard microtexts. phonetic algorithms can be used to transform microtexts into standard texts. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation.
Score: 0.5755004576310332
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non-standard microtexts into standard well-written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non-standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so-called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non-standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system. KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching; Twitter; texting

Related papers

Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation [13.70446799743065]
Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding maps each character to specific byte(s) eliminating the emergence of unknown words, even in new languages. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. We propose Mixture of Contextualization Experts (MoCE), adaptively selecting and mixing attention heads, which are treated as contextualization experts.
arXiv Detail & Related papers (2024-11-03T08:15:43Z)
Phonetically rich corpus construction for a low-resourced language [0.0]
This paper proposes a novel approach to create a textitcorpus with broad phonetic coverage for a low-resourced language. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Using our algorithm, we achieve a 55.8% higher percentage of distinct triphones -- for samples of similar size.
arXiv Detail & Related papers (2024-02-08T16:36:11Z)
Looking and Listening: Audio Guided Text Recognition [62.98768236858089]
Text recognition in the wild is a long-standing problem in computer vision. Recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
arXiv Detail & Related papers (2023-06-06T08:08:18Z)
An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer [37.0774363352316]
We propose an end-to-end Chinese text normalization model, which accepts Chinese characters as direct input. We also release a first publicly accessible largescale dataset for Chinese text normalization.
arXiv Detail & Related papers (2022-03-31T11:19:53Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z)
Determinantal Beam Search [75.84501052642361]
Beam search is a go-to strategy for decoding neural sequence models. In use-cases that call for multiple solutions, a diverse or representative set is often desired. By posing iterations in beam search as a series of subdeterminant problems, we can turn the algorithm into a diverse subset selection process.
arXiv Detail & Related papers (2021-06-14T13:01:46Z)
Match-Ignition: Plugging PageRank into Transformer for Long-form Text Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem. The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information. Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z)
An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection [4.20380265888641]
We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention. The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise.
arXiv Detail & Related papers (2021-01-09T06:57:09Z)
Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
Normalizing Text using Language Modelling based on Phonetics and String Similarity [0.0]
We propose a new robust model to perform text normalization. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.
arXiv Detail & Related papers (2020-06-25T00:42:39Z)
Investigating Label Bias in Beam Search for Open-ended Text Generation [8.331919991368366]
In open-ended text generation, beam search is often found to produce repetitive and generic texts. Standard seq2seq models suffer from label bias due to its locally normalized probability formulation. By combining locally normalized maximum likelihood estimation and globally normalized sequence-level training, label bias can be reduced with almost no sacrifice in perplexity.
arXiv Detail & Related papers (2020-05-22T05:17:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.