NUBIA: NeUral Based Interchangeability Assessor for Text Generation
- URL: http://arxiv.org/abs/2004.14667v2
- Date: Fri, 1 May 2020 09:58:56 GMT
- Title: NUBIA: NeUral Based Interchangeability Assessor for Text Generation
- Authors: Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh,
Mohamed Coulibali
- Abstract summary: We present NUBIA, a methodology to build automatic evaluation metrics for text generation using only machine learning models as core components.
A typical NUBIA model is composed of three modules: a neural feature extractor, an aggregator and a calibrator.
We demonstrate an implementation of NUBIA which outperforms metrics currently used to evaluate machine translation, summaries and slightly exceeds/matches state of the art metrics on correlation with human judgement.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present NUBIA, a methodology to build automatic evaluation metrics for
text generation using only machine learning models as core components. A
typical NUBIA model is composed of three modules: a neural feature extractor,
an aggregator and a calibrator. We demonstrate an implementation of NUBIA which
outperforms metrics currently used to evaluate machine translation, summaries
and slightly exceeds/matches state of the art metrics on correlation with human
judgement on the WMT segment-level Direct Assessment task, sentence-level
ranking and image captioning evaluation. The model implemented is modular,
explainable and set to continuously improve over time.
Related papers
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - CTRLEval: An Unsupervised Reference-Free Metric for Evaluating
Controlled Text Generation [85.03709740727867]
We propose an unsupervised reference-free metric calledEval to evaluate controlled text generation models.
Eval assembles the generation probabilities from a pre-trained language model without any model training.
Experimental results show that our metric has higher correlations with human judgments than other baselines.
arXiv Detail & Related papers (2022-04-02T13:42:49Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - Semantic-based Self-Critical Training For Question Generation [0.0]
We present a fully Transformer-based reinforcement learning generator-evaluator architecture for neural question generation.
We come up with a semantic-based self-critical training layout in generator-evaluator architecture, which goes beyond typical maximum likelihood training.
arXiv Detail & Related papers (2021-08-26T20:33:35Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.