Detection of tortured phrases in scientific literature
- URL: http://arxiv.org/abs/2402.03370v1
- Date: Fri, 2 Feb 2024 08:15:43 GMT
- Title: Detection of tortured phrases in scientific literature
- Authors: El\'ena Martel (SIGMA, LIG), Martin Lentschat (SIGMA, GETALP), Cyril
Labb\'e (LIG, SIGMA )
- Abstract summary: This paper presents various automatic detection methods to extract so called tortured phrases from scientific papers.
With a recall value of.87 and a precision value of.61, it could retrieve new tortured phrases to be submitted to domain experts for validation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents various automatic detection methods to extract so called
tortured phrases from scientific papers. These tortured phrases, e.g. flag to
clamor instead of signal to noise, are the results of paraphrasing tools used
to escape plagiarism detection. We built a dataset and evaluated several
strategies to flag previously undocumented tortured phrases. The proposed and
tested methods are based on language models and either on embeddings
similarities or on predictions of masked token. We found that an approach using
token prediction and that propagates the scores to the chunk level gives the
best results. With a recall value of .87 and a precision value of .61, it could
retrieve new tortured phrases to be submitted to domain experts for validation.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Effects of term weighting approach with and without stop words removing
on Arabic text classification [0.9217021281095907]
This study compares the effects of Binary and Term frequency weighting feature methodologies on the text's classification method when stop words are eliminated.
For all metrics, the term frequency feature weighting approach with stop word removal outperforms the binary approach.
It is clear from the data that, using the same phrase weighting approach, stop word removing increases classification accuracy.
arXiv Detail & Related papers (2024-02-21T11:31:04Z) - Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech
Recognition [49.42732949233184]
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition.
Taking noisy labels as ground-truth in the loss function results in suboptimal performance.
We propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels.
arXiv Detail & Related papers (2023-08-12T12:13:52Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - A Semantic Approach to Negation Detection and Word Disambiguation with
Natural Language Processing [1.0499611180329804]
This study aims to demonstrate the methods for detecting negations in a sentence by uniquely evaluating the lexical structure of the text.
The proposed method examined all the unique features of the related expressions within a text to resolve the contextual usage of the sentence.
arXiv Detail & Related papers (2023-02-05T03:58:45Z) - Investigating the detection of Tortured Phrases in Scientific Literature [0.0]
A recent study introduced the concept of 'tortured phrase', an unexpected odd phrase that appears instead of the fixed expression.
The present study aims at investigating how tortured phrases, that are not yet listed, can be detected automatically.
arXiv Detail & Related papers (2022-10-24T08:15:22Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Randomized Substitution and Vote for Textual Adversarial Example
Detection [6.664295299367366]
A line of work has shown that natural text processing models are vulnerable to adversarial examples.
We propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V)
Empirical evaluations on three benchmark datasets demonstrate that RS&V could detect the textual adversarial examples more successfully than the existing detection methods.
arXiv Detail & Related papers (2021-09-13T04:17:58Z) - Using BERT Encoding to Tackle the Mad-lib Attack in SMS Spam Detection [0.0]
We investigate whether language models sensitive to the semantics and context of words, such as Google's BERT, may be useful to overcome this adversarial attack.
Using a dataset of 5572 SMS spam messages, we first established a baseline of detection performance.
Then, we built a thesaurus of the vocabulary contained in these messages, and set up a Mad-lib attack experiment.
We found that the classic models achieved a 94% Balanced Accuracy (BA) in the original dataset, whereas the BERT model obtained 96%.
arXiv Detail & Related papers (2021-07-13T21:17:57Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.