Related papers: Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

URL: http://arxiv.org/abs/2010.03656v1
Date: Wed, 7 Oct 2020 21:17:25 GMT
Title: Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data
Authors: Shachar Rosenman, Alon Jacovi, Yoav Goldberg
Abstract summary: We identify failure modes of SOTA relation extraction (RE) models trained on TACRED. By adding some of the challenge data as training examples, the performance of the model improves.
Score: 49.378860065474875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The process of collecting and annotating training data may introduce distribution artifacts which may limit the ability of models to learn correct generalization behavior. We identify failure modes of SOTA relation extraction (RE) models trained on TACRED, which we attribute to limitations in the data annotation process. We collect and annotate a challenge-set we call Challenging RE (CRE), based on naturally occurring corpus examples, to benchmark this behavior. Our experiments with four state-of-the-art RE models show that they have indeed adopted shallow heuristics that do not generalize to the challenge-set data. Further, we find that alternative question answering modeling performs significantly better than the SOTA models on the challenge-set, despite worse overall TACRED performance. By adding some of the challenge data as training examples, the performance of the model improves. Finally, we provide concrete suggestion on how to improve RE data collection to alleviate this behavior.

Related papers

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling [27.11560841914813]
We introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses.<n>We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails.
arXiv Detail & Related papers (2025-07-08T21:56:33Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations. They generate only a limited range of perturbations for a single Information Extraction (IE) task. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench. We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery [11.20326903218271]
Post-training techniques such as instruction tuning are commonly employed to recover model performance. However, some instruction data irrelevant to model capability recovery may introduce negative effects. We propose PASER to identify instructions where model capabilities are most severely compromised.
arXiv Detail & Related papers (2025-02-18T07:11:08Z)
Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling. Our research explores task-specific model pruning to inform decisions about designing SMoE architectures. We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z)
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging [65.41765072566287]
We propose textbfDomain knowledtextbfge merged textbfReward textbfModel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging.
arXiv Detail & Related papers (2024-07-01T17:01:54Z)
Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems [17.10762463903638]
We train evaluation models to approximate human evaluation, achieving high agreement. We propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model.
arXiv Detail & Related papers (2024-06-26T10:48:14Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
Improving QA Model Performance with Cartographic Inoculation [0.0]
"Dataset artifacts" reduce the model's ability to generalize to real-world QA problems. We analyze the impacts and incidence of dataset artifacts using an adversarial challenge set. We show that by selectively fine-tuning a model on ambiguous adversarial examples from a challenge set, significant performance improvements can be made.
arXiv Detail & Related papers (2024-01-30T23:08:26Z)
Learning a model is paramount for sample efficiency in reinforcement learning control of PDEs [5.488334211013093]
We show that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system. We also show that iteratively updating the model is of major importance to avoid biases in the RL training.
arXiv Detail & Related papers (2023-02-14T16:14:39Z)
Regularizing Generative Adversarial Networks under Limited Data [88.57330330305535]
This work proposes a regularization approach for training robust GAN models on limited data. We show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data.
arXiv Detail & Related papers (2021-04-07T17:59:06Z)
One for More: Selecting Generalizable Samples for Generalizable ReID Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function. Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)
Factual Error Correction for Abstractive Summarization Models [41.77317902748772]
We propose a post-editing corrector module to correct factual errors in generated summaries. We show that our model is able to correct factual errors in summaries generated by other neural summarization models. We also find that transferring from artificial error correction to downstream settings is still very challenging.
arXiv Detail & Related papers (2020-10-17T04:24:16Z)
Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation [86.40610684026262]
In this work, we explore to identify the inactive training examples which contribute less to the model performance. We introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models.
arXiv Detail & Related papers (2020-10-06T08:57:31Z)
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension [27.538957000237176]
Humans create questions adversarially, such that the model fails to answer them correctly. We collect 36,000 samples with progressively stronger models in the annotation loop. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets. We find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
arXiv Detail & Related papers (2020-02-02T00:22:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.