Related papers: Authorship Verification based on the Likelihood Ratio of Grammar Models

Authorship Verification based on the Likelihood Ratio of Grammar Models

URL: http://arxiv.org/abs/2403.08462v1
Date: Wed, 13 Mar 2024 12:25:47 GMT
Title: Authorship Verification based on the Likelihood Ratio of Grammar Models
Authors: Andrea Nini, Oren Halvani, Lukas Graner, Valerio Gherardi, Shunichi Ishihara
Abstract summary: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. We propose a method relying on calculating a quantity we call $lambda_G$ (LambdaG) Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity.
Score: 0.8749675983608172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Authorship Verification (AV) is the process of analyzing a set of documents to determine whether they were written by a specific author. This problem often arises in forensic scenarios, e.g., in cases where the documents in question constitute evidence for a crime. Existing state-of-the-art AV methods use computational solutions that are not supported by a plausible scientific explanation for their functioning and that are often difficult for analysts to interpret. To address this, we propose a method relying on calculating a quantity we call $\lambda_G$ (LambdaG): the ratio between the likelihood of a document given a model of the Grammar for the candidate author and the likelihood of the same document given a model of the Grammar for a reference population. These Grammar Models are estimated using $n$-gram language models that are trained solely on grammatical features. Despite not needing large amounts of data for training, LambdaG still outperforms other established AV methods with higher computational complexity, including a fine-tuned Siamese Transformer network. Our empirical evaluation based on four baseline methods applied to twelve datasets shows that LambdaG leads to better results in terms of both accuracy and AUC in eleven cases and in all twelve cases if considering only topic-agnostic methods. The algorithm is also highly robust to important variations in the genre of the reference population in many cross-genre comparisons. In addition to these properties, we demonstrate how LambdaG is easier to interpret than the current state-of-the-art. We argue that the advantage of LambdaG over other methods is due to fact that it is compatible with Cognitive Linguistic theories of language processing.

Related papers

Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering [27.193336817953142]
We integrate different discrete subset sampling methods into a graph-based visual question answering system. We show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy. We also conduct a human evaluation to assess the interpretability of the generated subgraphs.
arXiv Detail & Related papers (2024-12-11T10:18:37Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
Detecting and explaining (in)equivalence of context-free grammars [0.6282171844772422]
We propose a scalable framework for deciding, proving, and explaining (in)equivalence of context-free grammars. We present an implementation of the framework and evaluate it on large data sets collected within educational support systems.
arXiv Detail & Related papers (2024-07-25T17:36:18Z)
H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables [56.73919743039263]
This paper introduces a novel algorithm that integrates both symbolic and semantic (textual) approaches in a two-stage process to address limitations. Our experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three question-answering (QA) and fact-verification datasets.
arXiv Detail & Related papers (2024-06-29T21:24:19Z)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. We train a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant. Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
ChatGraph: Interpretable Text Classification by Converting ChatGPT Knowledge to Graphs [54.48467003509595]
ChatGPT has shown superior performance in various natural language processing (NLP) tasks. We propose a novel framework that leverages the power of ChatGPT for specific tasks, such as text classification. Our method provides a more transparent decision-making process compared with previous text classification methods.
arXiv Detail & Related papers (2023-05-03T19:57:43Z)
Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences [1.8592384822257952]
We propose a novel topic modeling and inference algorithm. We leverage pre-trained sentence embeddings by combining generative process models and clustering. TheTailor evaluation shows that our method yields state-of-the art results with relatively little computational demands.
arXiv Detail & Related papers (2023-02-06T20:13:11Z)
MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z)
Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods [6.018950511093273]
Saliency maps can explain a neural model's predictions by identifying important input features. We formalize the underexplored task of translating saliency maps into natural language. We compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations.
arXiv Detail & Related papers (2022-10-13T17:48:15Z)
A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification. The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample. A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z)
More Than Words: Towards Better Quality Interpretations of Text Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations. We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z)
A Syntax-Guided Grammatical Error Correction Model with Dependency Tree Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences. We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees. We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z)
Studying word order through iterative shuffling [14.530986799844873]
We show that word order encodes meaning essential to performing NLP benchmark tasks. We use IBIS, a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model. We discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
arXiv Detail & Related papers (2021-09-10T13:27:06Z)
Explaining Neural Network Predictions on Sentence Pairs via Learning Word-Group Masks [21.16662651409811]
We propose the Group Mask (GMASK) method to implicitly detect word correlations by grouping correlated words from the input text pair together. The proposed method is evaluated with two different model architectures (decomposable attention model and BERT) across four datasets.
arXiv Detail & Related papers (2021-04-09T17:14:34Z)
Improving Authorship Verification using Linguistic Divergence [6.673132899229721]
We propose an unsupervised solution to the Authorship Verification task that utilizes pre-trained deep language models. The proposed metric is a measure of the difference between the two authors comparing against pre-trained language models.
arXiv Detail & Related papers (2021-03-12T03:01:17Z)
The Return of Lexical Dependencies: Neural Lexicalized PCFGs [103.41187595153652]
We present novel neural models of lexicalized PCFGs which allow us to overcome sparsity problems. Experiments demonstrate that this unified framework results in stronger results on both representations than achieved when either formalism alone.
arXiv Detail & Related papers (2020-07-29T22:12:49Z)
Traduction des Grammaires Cat\'egorielles de Lambek dans les Grammaires Cat\'egorielles Abstraites [0.0]
This internship report is to demonstrate that every Lambek Grammar can be, not entirely but efficiently, expressed in Abstract Categorial Grammars (ACG) The main idea is to transform the type rewriting system of LGs into that of Context-Free Grammars (CFG) by erasing introduction and elimination rules and generating enough axioms so that the cut rule suffices. Although the underlying algorithm was not fully implemented, this proof provides another argument in favour of the relevance of ACGs in Natural Language Processing.
arXiv Detail & Related papers (2020-01-23T18:23:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.