Related papers: Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

URL: http://arxiv.org/abs/2304.03145v2
Date: Tue, 16 Apr 2024 18:04:14 GMT
Title: Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming
Authors: Clemencia Siro, Tunde Oluwaseyi Ajayi,
Abstract summary: We explore robustness of MRC models to entity renaming. We rename entities of type: country, person, nationality, location, organization, and city. We find that compared to base models, large models perform well comparatively on novel entities.
Score: 3.117224133280308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.

Related papers

Self-Improving LLM Agents at Test-Time [49.9396634315896]
One paradigm of language model (LM) fine-tuning relies on creating large training datasets.<n>In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive.<n>We study two variants of this approach: Test-Time Self-Improvement (TT-SI) and Test-Time Distillation (TT-D)
arXiv Detail & Related papers (2025-10-09T06:37:35Z)
Conformalized Exceptional Model Mining: Telling Where Your Model Performs (Not) Well [31.013018198280506]
This paper introduces a novel framework, Conformalized Exceptional Model Mining.<n>It combines the rigor of Conformal Prediction with the explanatory power of Exceptional Model Mining.<n>We develop a new model class, mSMoPE, which quantifies uncertainty through conformal prediction's rigorous coverage guarantees.
arXiv Detail & Related papers (2025-08-21T13:43:14Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios [5.617202699068449]
We evaluate the robustness of several large language models on multiple datasets. Benchmark datasets are constructed by introducing naturally-preserving, non-malicious perturbations.
arXiv Detail & Related papers (2024-08-04T08:43:09Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA) We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging. Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z)
On the Robustness of Reading Comprehension Models to Entity Renaming [44.11484801074727]
We study the robustness of machine reading comprehension (MRC) models to entity renaming. We propose a general and scalable method to replace person names with names from a variety of sources. We find that MRC models consistently perform worse when entities are renamed.
arXiv Detail & Related papers (2021-10-16T11:46:32Z)
RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models [32.806292167848156]
We propose RockNER to audit the robustness of named entity recognition models. We replace target entities with other entities of the same semantic class in Wikidata. At the context level, we use pre-trained language models to generate word substitutions.
arXiv Detail & Related papers (2021-09-12T21:30:21Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Towards Trustworthy Deception Detection: Benchmarking Model Robustness across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English. We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z)
Benchmarking Robustness of Machine Reading Comprehension Models [29.659586787812106]
We construct AdvRACE, a new model-agnostic benchmark for evaluating the robustness of MRC models under four different types of adversarial attacks. We show that state-of-the-art (SOTA) models are vulnerable to all of these attacks. We conclude that there is substantial room for building more robust MRC models and our benchmark can help motivate and measure progress in this area.
arXiv Detail & Related papers (2020-04-29T08:05:32Z)
Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains. We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.