Related papers: Word-level Human Interpretable Scoring Mechanism for Novel Text Detection Using Tsetlin Machines

Word-level Human Interpretable Scoring Mechanism for Novel Text Detection Using Tsetlin Machines

URL: http://arxiv.org/abs/2105.04708v1
Date: Mon, 10 May 2021 23:41:14 GMT
Title: Word-level Human Interpretable Scoring Mechanism for Novel Text Detection Using Tsetlin Machines
Authors: Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao
Abstract summary: We propose a Tsetlin machine architecture for scoring individual words according to their contribution to novelty. Our approach encodes a description of the novel documents using the linguistic patterns captured by TM clauses. We then adopt this description to measure how much a word contributes to making documents novel.
Score: 16.457778420360537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research in novelty detection focuses mainly on document-level classification, employing deep neural networks (DNN). However, the black-box nature of DNNs makes it difficult to extract an exact explanation of why a document is considered novel. In addition, dealing with novelty at the word-level is crucial to provide a more fine-grained analysis than what is available at the document level. In this work, we propose a Tsetlin machine (TM)-based architecture for scoring individual words according to their contribution to novelty. Our approach encodes a description of the novel documents using the linguistic patterns captured by TM clauses. We then adopt this description to measure how much a word contributes to making documents novel. Our experimental results demonstrate how our approach breaks down novelty into interpretable phrases, successfully measuring novelty.

Related papers

Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context [30.461643690171258]
Lexical ambiguity is a challenging and pervasive problem in machine translation (mt) We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural mt. Our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines.
arXiv Detail & Related papers (2023-11-27T03:05:48Z)
Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution. The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z)
Improving Long Context Document-Level Machine Translation [51.359400776242786]
Document-level context for neural machine translation (NMT) is crucial to improve translation consistency and cohesion. Many works have been published on the topic of document-level NMT, but most restrict the system to just local context. We propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.
arXiv Detail & Related papers (2023-06-08T13:28:48Z)
On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document. In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z)
HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality. The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context. We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z)
Sentiment-Aware Word and Sentence Level Pre-training for Sentiment Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks. SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z)
Measuring the Novelty of Natural Language Text Using the Conjunctive Clauses of a Tsetlin Machine Text Classifier [12.087658145293522]
Most supervised text classification approaches assume a closed world, counting on all classes being present in the data at training time. This assumption can lead to unpredictable behaviour during operation, whenever novel, previously unseen, classes appear. We extend the recently introduced Tsetlin machine (TM) with a novelty scoring mechanism.
arXiv Detail & Related papers (2020-11-17T16:35:21Z)
Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service [6.696983725360808]
This paper proposes the use of NLP techniques for textual classification. Our main goal is to automate the process of assigning petitions to their respective areas of law. The best results were obtained with a combination of Word2Vec trained on a domain-specific corpus and a Recurrent Neural Network architecture.
arXiv Detail & Related papers (2020-10-13T18:05:37Z)
Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.