Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
- URL: http://arxiv.org/abs/2404.18510v2
- Date: Mon, 1 Jul 2024 15:58:11 GMT
- Title: Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
- Authors: Dana Roemling, Yves Scherrer, Aleksandra Miletic,
- Abstract summary: We explore the explainability of machine learning approaches considering the forensic context.
We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area.
We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
- Score: 46.58131072375399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Forensic authorship profiling uses linguistic markers to infer characteristics about an author of a text. This task is paralleled in dialect classification, where a prediction is made about the linguistic variety of a text based on the text itself. While there have been significant advances in recent years in variety classification, forensic linguistics rarely relies on these approaches due to their lack of transparency, among other reasons. In this paper we therefore explore the explainability of machine learning approaches considering the forensic context. We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area. For this, we identify the lexical items that are the most impactful for the variety classification. We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
Related papers
- Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach [4.161155428666988]
Stylometry aims to distinguish authors by analyzing literary traits assumed to reflect semi-conscious choices distinct from elements like genre or theme.
While some literary properties, such as thematic content, are likely to manifest as correlations between adjacent text units, others, like authorial style, may be independent thereof.
We introduce a hypothesis-testing approach to evaluate the influence of sequentially correlated literary properties on text classification.
arXiv Detail & Related papers (2024-11-07T18:28:40Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers [43.756851270091516]
We present a novel approach to extract distinguishing lexical features of dialects by utilizing interpretable dialects.
We experimentally demonstrate that our method successfully identifies key language-specific lexical features that contribute to dialectal variations.
arXiv Detail & Related papers (2024-02-27T22:06:55Z) - Classifying text using machine learning models and determining
conversation drift [4.785406121053965]
An analysis of various types of texts is invaluable to understanding both their semantic meaning, as well as their relevance.
Text classification is a method of categorising documents.
It combines computer text classification and natural language processing to analyse text in aggregate.
arXiv Detail & Related papers (2022-11-15T18:09:45Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - Linguistic Profiling of a Neural Language Model [1.0552465253379135]
We investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process.
We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks.
arXiv Detail & Related papers (2020-10-05T09:09:01Z) - Comparative Analysis of Text Classification Approaches in Electronic
Health Records [0.6229951975208341]
We analyse the impact of various word representations, text pre-processing and classification algorithms on the performance of four different text classification tasks.
Results show that traditional approaches, when tailored to the specific language and structure of the text inherent to the classification task, can achieve or exceed the performance of more recent ones.
arXiv Detail & Related papers (2020-05-08T14:04:18Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.