What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult
- URL: http://arxiv.org/abs/2104.01940v1
- Date: Mon, 5 Apr 2021 14:56:09 GMT
- Title: What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult
- Authors: Avishai Zagoury and Einat Minkov and Idan Szpektor and William W.
Cohen
- Abstract summary: We study the ability of neural LMs to ask (not answer) reasonable questions.
We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
- Score: 22.04829832439774
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although large neural language models (LMs) like BERT can be finetuned to
yield state-of-the-art results on many NLP tasks, it is often unclear what
these models actually learn. Here we study using such LMs to fill in entities
in human-authored comparative questions, like ``Which country is older, India
or ______?'' -- i.e., we study the ability of neural LMs to ask (not answer)
reasonable questions. We show that accuracy in this fill-in-the-blank task is
well-correlated with human judgements of whether a question is reasonable, and
that these models can be trained to achieve nearly human-level performance in
completing comparative questions in three different subdomains. However,
analysis shows that what they learn fails to model any sort of broad notion of
which entities are semantically comparable or similar -- instead the trained
models are very domain-specific, and performance is highly correlated with
co-occurrences between specific entities observed in the training set. This is
true both for models that are pretrained on general text corpora, as well as
models trained on a large corpus of comparison questions. Our study thus
reinforces recent results on the difficulty of making claims about a deep
model's world knowledge or linguistic competence based on performance on
specific benchmark problems. We make our evaluation datasets publicly available
to foster future research on complex understanding and reasoning in such models
at standards of human interaction.
Related papers
- OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving? [2.851415653352522]
The Orion-1 model by OpenAI is claimed to have more robust logical reasoning capabilities than previous large language models.
We conduct a comparison experiment using two datasets: one consisting of International Mathematics Olympiad (IMO) problems, which is easily accessible.
We conclude that there is no significant evidence to show that the model relies on memorizing problems and solutions.
arXiv Detail & Related papers (2024-11-09T14:47:52Z) - Longer Fixations, More Computation: Gaze-Guided Recurrent Neural
Networks [12.57650361978445]
Humans read texts at a varying pace, while machine learning models treat each token in the same way.
In this paper, we convert this intuition into a set of novel models with fixation-guided parallel RNNs or layers.
We find that, interestingly, the fixation duration predicted by neural networks bears some resemblance to humans' fixation.
arXiv Detail & Related papers (2023-10-31T21:32:11Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Can NLP Models Correctly Reason Over Contexts that Break the Common
Assumptions? [14.991565484636745]
We investigate the ability of NLP models to correctly reason over contexts that break the common assumptions.
We show that while doing fairly well on contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those assumptions.
Specifically, the performance gap is as high as 20% absolute points.
arXiv Detail & Related papers (2023-05-20T05:20:37Z) - A quantitative study of NLP approaches to question difficulty estimation [0.30458514384586394]
This work quantitatively analyzes several approaches proposed in previous research, and comparing their performance on datasets from different educational domains.
We find that Transformer based models are the best performing across different educational domains, with DistilBERT performing almost as well as BERT.
As for the other models, the hybrid ones often outperform the ones based on a single type of features, the ones based on linguistic features perform well on reading comprehension questions, while frequency based features (TF-IDF) and word embeddings (word2vec) perform better in domain knowledge assessment.
arXiv Detail & Related papers (2023-05-17T14:26:00Z) - ALERT: Adapting Language Models to Reasoning Tasks [43.8679673685468]
ALERT is a benchmark and suite of analyses for assessing language models' reasoning ability.
ALERT provides a test bed to asses any language model on fine-grained reasoning skills.
We find that language models learn more reasoning skills during finetuning stage compared to pretraining state.
arXiv Detail & Related papers (2022-12-16T05:15:41Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.