Do Question Answering Modeling Improvements Hold Across Benchmarks?
- URL: http://arxiv.org/abs/2102.01065v3
- Date: Tue, 30 May 2023 20:50:47 GMT
- Title: Do Question Answering Modeling Improvements Hold Across Benchmarks?
- Authors: Nelson F. Liu and Tony Lee and Robin Jia and Percy Liang
- Abstract summary: We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
- Score: 84.48867898593052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Do question answering (QA) modeling improvements (e.g., choice of
architecture and training procedure) hold consistently across the diverse
landscape of QA benchmarks? To study this question, we introduce the notion of
concurrence -- two benchmarks have high concurrence on a set of modeling
approaches if they rank the modeling approaches similarly. We measure the
concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches
and find that human-constructed benchmarks have high concurrence amongst
themselves, even if their passage and question distributions are very
different. Surprisingly, even downsampled human-constructed benchmarks (i.e.,
collecting less data) and programmatically-generated benchmarks (e.g.,
cloze-formatted examples) have high concurrence with human-constructed
benchmarks. These results indicate that, despite years of intense community
focus on a small number of benchmarks, the modeling improvements studied hold
broadly.
Related papers
- Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date.
We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models.
Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z) - Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare [99.57567498494448]
We introduce Compare2Score, an all-around LMM-based no-reference IQA model.
During training, we generate scaled-up comparative instructions by comparing images from the same IQA dataset.
Experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training.
arXiv Detail & Related papers (2024-05-29T17:26:09Z) - Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks [2.1899189033259305]
The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.
This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest.
We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, and (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
arXiv Detail & Related papers (2024-04-25T18:35:54Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z) - How not to Lie with a Benchmark: Rearranging NLP Leaderboards [0.0]
We examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean.
We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME.
arXiv Detail & Related papers (2021-12-02T15:40:52Z) - DQI: A Guide to Benchmark Evaluation [22.54066527822898]
A model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E.
We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.
arXiv Detail & Related papers (2020-08-10T08:38:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.