Perturbations and Subpopulations for Testing Robustness in Token-Based
Argument Unit Recognition
- URL: http://arxiv.org/abs/2209.14780v1
- Date: Thu, 29 Sep 2022 13:44:28 GMT
- Title: Perturbations and Subpopulations for Testing Robustness in Token-Based
Argument Unit Recognition
- Authors: Jonathan Kamp, Lisa Beinborn, Antske Fokkens
- Abstract summary: Argument Unit Recognition and Classification aims at identifying argument units from text and classifying them as pro or against.
One of the design choices that need to be made when developing systems for this task is what the unit of classification should be: segments of tokens or full sentences.
Previous research suggests that fine-tuning language models on the token-level yields more robust results for classifying sentences compared to training on sentences directly.
We reproduce the study that originally made this claim and further investigate what exactly token-based systems learned better compared to sentence-based ones.
- Score: 6.502694770864571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Argument Unit Recognition and Classification aims at identifying argument
units from text and classifying them as pro or against. One of the design
choices that need to be made when developing systems for this task is what the
unit of classification should be: segments of tokens or full sentences.
Previous research suggests that fine-tuning language models on the token-level
yields more robust results for classifying sentences compared to training on
sentences directly. We reproduce the study that originally made this claim and
further investigate what exactly token-based systems learned better compared to
sentence-based ones. We develop systematic tests for analysing the behavioural
differences between the token-based and the sentence-based system. Our results
show that token-based models are generally more robust than sentence-based
models both on manually perturbed examples and on specific subpopulations of
the data.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Comparison Study Between Token Classification and Sequence
Classification In Text Classification [0.45687771576879593]
Unsupervised Machine Learning techniques have been applied to Natural Language Processing tasks and surpasses the benchmarks such as GLUE with great success.
Building language models approach good results in one language and it can be applied to multiple NLP task such as classification, summarization, generation and etc as out of box models.
arXiv Detail & Related papers (2022-11-25T05:14:58Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - Discriminative Language Model as Semantic Consistency Scorer for
Prompt-based Few-Shot Text Classification [10.685862129925727]
This paper proposes a novel prompt-based finetuning method (called DLM-SCS) for few-shot text classification.
The underlying idea is that the prompt instantiated with the true label should have higher semantic consistency score than other prompts with false labels.
Our model outperforms several state-of-the-art prompt-based few-shot methods.
arXiv Detail & Related papers (2022-10-23T16:10:48Z) - Reweighting Strategy based on Synthetic Data Identification for Sentence
Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences.
The distilled information from the classifier is then used to train a reliable sentence embedding model.
Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z) - Resolving label uncertainty with implicit posterior models [71.62113762278963]
We propose a method for jointly inferring labels across a collection of data samples.
By implicitly assuming the existence of a generative model for which a differentiable predictor is the posterior, we derive a training objective that allows learning under weak beliefs.
arXiv Detail & Related papers (2022-02-28T18:09:44Z) - More Than Words: Towards Better Quality Interpretations of Text
Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - "Sharks are not the threat humans are": Argument Component Segmentation
in School Student Essays [3.632177840361928]
We apply a token-level classification to identify claim and premise tokens from a new corpus of argumentative essays written by middle school students.
We demonstrate that a BERT-based multi-task learning architecture (i.e., token and sentence level classification) adaptively pretrained on a relevant unlabeled dataset obtains the best results.
arXiv Detail & Related papers (2021-03-08T02:40:07Z) - Narrative Incoherence Detection [76.43894977558811]
We propose the task of narrative incoherence detection as a new arena for inter-sentential semantic understanding.
Given a multi-sentence narrative, decide whether there exist any semantic discrepancies in the narrative flow.
arXiv Detail & Related papers (2020-12-21T07:18:08Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.