Word-level Human Interpretable Scoring Mechanism for Novel Text
Detection Using Tsetlin Machines
- URL: http://arxiv.org/abs/2105.04708v1
- Date: Mon, 10 May 2021 23:41:14 GMT
- Title: Word-level Human Interpretable Scoring Mechanism for Novel Text
Detection Using Tsetlin Machines
- Authors: Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao
- Abstract summary: We propose a Tsetlin machine architecture for scoring individual words according to their contribution to novelty.
Our approach encodes a description of the novel documents using the linguistic patterns captured by TM clauses.
We then adopt this description to measure how much a word contributes to making documents novel.
- Score: 16.457778420360537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research in novelty detection focuses mainly on document-level
classification, employing deep neural networks (DNN). However, the black-box
nature of DNNs makes it difficult to extract an exact explanation of why a
document is considered novel. In addition, dealing with novelty at the
word-level is crucial to provide a more fine-grained analysis than what is
available at the document level. In this work, we propose a Tsetlin machine
(TM)-based architecture for scoring individual words according to their
contribution to novelty. Our approach encodes a description of the novel
documents using the linguistic patterns captured by TM clauses. We then adopt
this description to measure how much a word contributes to making documents
novel. Our experimental results demonstrate how our approach breaks down
novelty into interpretable phrases, successfully measuring novelty.
Related papers
- Improving Word Sense Disambiguation in Neural Machine Translation with
Salient Document Context [30.461643690171258]
Lexical ambiguity is a challenging and pervasive problem in machine translation (mt)
We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural mt.
Our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines.
arXiv Detail & Related papers (2023-11-27T03:05:48Z) - Integrating Bidirectional Long Short-Term Memory with Subword Embedding
for Authorship Attribution [2.3429306644730854]
Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution.
The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50.
arXiv Detail & Related papers (2023-06-26T11:35:47Z) - Improving Long Context Document-Level Machine Translation [51.359400776242786]
Document-level context for neural machine translation (NMT) is crucial to improve translation consistency and cohesion.
Many works have been published on the topic of document-level NMT, but most restrict the system to just local context.
We propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.
arXiv Detail & Related papers (2023-06-08T13:28:48Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - Measuring the Novelty of Natural Language Text Using the Conjunctive
Clauses of a Tsetlin Machine Text Classifier [12.087658145293522]
Most supervised text classification approaches assume a closed world, counting on all classes being present in the data at training time.
This assumption can lead to unpredictable behaviour during operation, whenever novel, previously unseen, classes appear.
We extend the recently introduced Tsetlin machine (TM) with a novelty scoring mechanism.
arXiv Detail & Related papers (2020-11-17T16:35:21Z) - Legal Document Classification: An Application to Law Area Prediction of
Petitions to Public Prosecution Service [6.696983725360808]
This paper proposes the use of NLP techniques for textual classification.
Our main goal is to automate the process of assigning petitions to their respective areas of law.
The best results were obtained with a combination of Word2Vec trained on a domain-specific corpus and a Recurrent Neural Network architecture.
arXiv Detail & Related papers (2020-10-13T18:05:37Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.