Summarization from Leaderboards to Practice: Choosing A Representation
Backbone and Ensuring Robustness
- URL: http://arxiv.org/abs/2306.10555v1
- Date: Sun, 18 Jun 2023 13:35:41 GMT
- Title: Summarization from Leaderboards to Practice: Choosing A Representation
Backbone and Ensuring Robustness
- Authors: David Demeter, Oshin Agarwal, Simon Ben Igeri, Marko Sterbentz, Neil
Molino, John M. Conroy, Ani Nenkova
- Abstract summary: In both automatic and human evaluation, BART performs better than PEG and T5.
We find considerable variation in system output that can be captured only with human evaluation.
- Score: 21.567112955050582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Academic literature does not give much guidance on how to build the best
possible customer-facing summarization system from existing research
components. Here we present analyses to inform the selection of a system
backbone from popular models; we find that in both automatic and human
evaluation, BART performs better than PEGASUS and T5. We also find that when
applied cross-domain, summarizers exhibit considerably worse performance. At
the same time, a system fine-tuned on heterogeneous domains performs well on
all domains and will be most suitable for a broad-domain summarizer. Our work
highlights the need for heterogeneous domain summarization benchmarks. We find
considerable variation in system output that can be captured only with human
evaluation and are thus unlikely to be reflected in standard leaderboards with
only automatic evaluation.
Related papers
- Unified Examination of Entity Linking in Absence of Candidate Sets [3.55026004901472]
We use an ablation study to investigate the impact of candidate sets on the performance of entity linking.
We show the trade-off between less restrictive candidate sets, increased inference time and memory footprint for some models.
arXiv Detail & Related papers (2024-04-17T04:37:58Z) - Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments.
Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains.
We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z) - To Adapt or to Annotate: Challenges and Interventions for Domain
Adaptation in Open-Domain Question Answering [46.403929561360485]
We study end-to-end model performance of open-domain question answering (ODQA)
We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy.
We propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
arXiv Detail & Related papers (2022-12-20T16:06:09Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z) - Review-Based Domain Disentanglement without Duplicate Users or Contexts
for Cross-Domain Recommendation [1.2074552857379273]
Cross-domain recommendation has shown promising results in solving data-sparsity and cold-start problems.
Our model (named SER) uses three text analysis modules, guided by a single domain discriminator for disentangled representation learning.
arXiv Detail & Related papers (2021-10-25T05:17:58Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z) - Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR)
MFR synthesizes the source/target domain shift with a meta-optimization objective.
We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.