Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study
in Polish
- URL: http://arxiv.org/abs/2206.02208v1
- Date: Sun, 5 Jun 2022 15:48:16 GMT
- Title: Stylistic Fingerprints, POS-tags and Inflected Languages: A Case Study
in Polish
- Authors: Maciej Eder and Rafa{\l}. L. G\'orski
- Abstract summary: Inflected languages make word forms sparse, making most statistical procedures complicated.
This paper examines the usefulness of grammatical features (as assessed via POS-tag n-grams) and lemmatized forms in recognizing author stylial profiles.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In stylometric investigations, frequencies of the most frequent words (MFWs)
and character n-grams outperform other style-markers, even if their performance
varies significantly across languages. In inflected languages, word endings
play a prominent role, and hence different word forms cannot be recognized
using generic text tokenization. Countless inflected word forms make
frequencies sparse, making most statistical procedures complicated. Presumably,
applying one of the NLP techniques, such as lemmatization and/or parsing, might
increase the performance of classification. The aim of this paper is to examine
the usefulness of grammatical features (as assessed via POS-tag n-grams) and
lemmatized forms in recognizing authorial profiles, in order to address the
underlying issue of the degree of freedom of choice within lexis and grammar.
Using a corpus of Polish novels, we performed a series of supervised authorship
attribution benchmarks, in order to compare the classification accuracy for
different types of lexical and syntactic style-markers. Even if the performance
of POS-tags as well as lemmatized forms was notoriously worse than that of
lexical markers, the difference was not substantial and never exceeded ca. 15%.
Related papers
- DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - How word semantics and phonology affect handwriting of Alzheimer's
patients: a machine learning based analysis [20.36565712578267]
We investigated how word semantics and phonology affect the handwriting of people affected by Alzheimer's disease.
We used the data from six handwriting tasks, each requiring copying a word belonging to one of the following categories.
The experimental results showed that the feature selection allowed us to derive a different set of highly distinctive features for each word type.
arXiv Detail & Related papers (2023-07-06T13:35:06Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z) - CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class
Classification [57.62886091828512]
We propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix) for many-class classification.
Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification.
arXiv Detail & Related papers (2022-11-11T03:45:59Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT [7.057643880514415]
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment is manifested across the embedding spaces of different languages.
arXiv Detail & Related papers (2021-01-26T19:21:59Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [22.93722845643562]
We show that POS tagging can still significantly improve parsing performance when using the Stack joint framework.
Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data.
arXiv Detail & Related papers (2020-03-06T13:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.