Authorship Verification based on the Likelihood Ratio of Grammar Models
- URL: http://arxiv.org/abs/2403.08462v1
- Date: Wed, 13 Mar 2024 12:25:47 GMT
- Title: Authorship Verification based on the Likelihood Ratio of Grammar Models
- Authors: Andrea Nini, Oren Halvani, Lukas Graner, Valerio Gherardi, Shunichi
Ishihara
- Abstract summary: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author.
We propose a method relying on calculating a quantity we call $lambda_G$ (LambdaG)
Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity.
- Score: 0.8749675983608172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Authorship Verification (AV) is the process of analyzing a set of documents
to determine whether they were written by a specific author. This problem often
arises in forensic scenarios, e.g., in cases where the documents in question
constitute evidence for a crime. Existing state-of-the-art AV methods use
computational solutions that are not supported by a plausible scientific
explanation for their functioning and that are often difficult for analysts to
interpret. To address this, we propose a method relying on calculating a
quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a
document given a model of the Grammar for the candidate author and the
likelihood of the same document given a model of the Grammar for a reference
population. These Grammar Models are estimated using $n$-gram language models
that are trained solely on grammatical features. Despite not needing large
amounts of data for training, LambdaG still outperforms other established AV
methods with higher computational complexity, including a fine-tuned Siamese
Transformer network. Our empirical evaluation based on four baseline methods
applied to twelve datasets shows that LambdaG leads to better results in terms
of both accuracy and AUC in eleven cases and in all twelve cases if considering
only topic-agnostic methods. The algorithm is also highly robust to important
variations in the genre of the reference population in many cross-genre
comparisons. In addition to these properties, we demonstrate how LambdaG is
easier to interpret than the current state-of-the-art. We argue that the
advantage of LambdaG over other methods is due to fact that it is compatible
with Cognitive Linguistic theories of language processing.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Detecting and explaining (in)equivalence of context-free grammars [0.6282171844772422]
We propose a scalable framework for deciding, proving, and explaining (in)equivalence of context-free grammars.
We present an implementation of the framework and evaluate it on large data sets collected within educational support systems.
arXiv Detail & Related papers (2024-07-25T17:36:18Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Efficient and Flexible Topic Modeling using Pretrained Embeddings and
Bag of Sentences [1.8592384822257952]
We propose a novel topic modeling and inference algorithm.
We leverage pre-trained sentence embeddings by combining generative process models and clustering.
TheTailor evaluation shows that our method yields state-of-the art results with relatively little computational demands.
arXiv Detail & Related papers (2023-02-06T20:13:11Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Studying word order through iterative shuffling [14.530986799844873]
We show that word order encodes meaning essential to performing NLP benchmark tasks.
We use IBIS, a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model.
We discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
arXiv Detail & Related papers (2021-09-10T13:27:06Z) - Traduction des Grammaires Cat\'egorielles de Lambek dans les Grammaires
Cat\'egorielles Abstraites [0.0]
This internship report is to demonstrate that every Lambek Grammar can be, not entirely but efficiently, expressed in Abstract Categorial Grammars (ACG)
The main idea is to transform the type rewriting system of LGs into that of Context-Free Grammars (CFG) by erasing introduction and elimination rules and generating enough axioms so that the cut rule suffices.
Although the underlying algorithm was not fully implemented, this proof provides another argument in favour of the relevance of ACGs in Natural Language Processing.
arXiv Detail & Related papers (2020-01-23T18:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.