Related papers: Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

URL: http://arxiv.org/abs/2507.01764v1
Date: Wed, 02 Jul 2025 14:46:26 GMT
Title: Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results
Authors: Matteo Di Cristofaro,
Abstract summary: This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings.<n>The research presents methods for ensuring that digital texts are accurately represented in corpora.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.

Related papers

Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study [4.417564179511245]
This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations.<n> syntactic and grammatical features retain strong discriminative power even in the absence of lexical content.<n>This study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.
arXiv Detail & Related papers (2026-02-11T16:53:57Z)
Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis [0.5545791216381869]
We explore how agentic large language models (LLMs) can streamline the systematic analysis of annotated corpora.<n>We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation.<n>We test the system on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS)
arXiv Detail & Related papers (2025-11-28T21:27:58Z)
Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings [2.1494179586067537]
This research introduces a novel psychometric method for analyzing textual data using large language models.<n>By leveraging contextual embeddings, we transform textual data into response data suitable for psychometric analysis.
arXiv Detail & Related papers (2025-09-10T18:31:37Z)
Combining Objective and Subjective Perspectives for Political News Understanding [5.741243797283764]
We introduce a text analysis framework which integrates both perspectives and provides a fine-grained processing of subjective aspects. We illustrate its functioning with insights on news outlets, political orientations, topics, individual entities, and demographic segments.
arXiv Detail & Related papers (2024-08-20T20:13:19Z)
Qualitative Data Analysis in Software Engineering: Techniques and Teaching Insights [10.222207222039048]
Software repositories are rich sources of qualitative artifacts, including source code comments, commit messages, issue descriptions, and documentation. This chapter shifts the focus towards interpreting these artifacts using various qualitative data analysis techniques. Various coding methods are discussed along with the strategic design of a coding guide to ensure consistency and accuracy in data interpretation.
arXiv Detail & Related papers (2024-06-12T13:56:55Z)
Capturing Pertinent Symbolic Features for Enhanced Content-Based Misinformation Detection [0.0]
The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability. This paper analyzes the linguistic attributes that characterize this phenomenon and how representative of such features some of the most popular misinformation datasets are. We demonstrate that the appropriate use of pertinent symbolic knowledge in combination with neural language models is helpful in detecting misleading content.
arXiv Detail & Related papers (2024-01-29T16:42:34Z)
Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts. We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Natural Language Decompositions of Implicit Content Enable Better Text Representations [52.992875653864076]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.<n>We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.<n>Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z)
An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z)
An Informational Space Based Semantic Analysis for Scientific Texts [62.997667081978825]
This paper introduces computational methods for semantic analysis and the quantifying the meaning of short scientific texts. The representation of scientific-specific meaning is standardised by replacing the situation representations, rather than psychological properties. The research in this paper conducts the base for the geometric representation of the meaning of texts.
arXiv Detail & Related papers (2022-05-31T11:19:32Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Natural language technology and query expansion: issues, state-of-the-art and perspectives [0.0]
Linguistic characteristics that cause ambiguity and misinterpretation of queries as well as additional factors affect the users ability to accurately represent their information needs. We lay down the anatomy of a generic linguistic based query expansion framework and propose its module-based decomposition. For each of the modules we review the state-of-the-art solutions in the literature and categorized under the light of the techniques used.
arXiv Detail & Related papers (2020-04-23T11:39:07Z)
A Framework for Evaluation of Machine Reading Comprehension Gold Standards [7.6250852763032375]
This paper proposes a unifying framework to investigate the present linguistic features, required reasoning and background knowledge and factual correctness. The absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.
arXiv Detail & Related papers (2020-03-10T11:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.