Related papers: BLEURT: Learning Robust Metrics for Text Generation

BLEURT: Learning Robust Metrics for Text Generation

URL: http://arxiv.org/abs/2004.04696v5
Date: Thu, 21 May 2020 16:53:47 GMT
Title: BLEURT: Learning Robust Metrics for Text Generation
Authors: Thibault Sellam, Dipanjan Das, Ankur P. Parikh
Abstract summary: We propose BLEURT, a learned evaluation metric based on BERT. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset.
Score: 17.40369189981227
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

Related papers

SMOTExT: SMOTE meets Large Language Models [19.394116388173885]
We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data.<n>Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples.<n>In early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset.
arXiv Detail & Related papers (2025-05-19T17:57:36Z)
Integrated ensemble of BERT- and features-based models for authorship attribution in Japanese literary works [2.624902795082451]
Authorship attribution (AA) tasks rely on statistical data analysis and classification based on stylistic features extracted from texts. In this study, we aimed to significantly improve performance using an integrated integrative ensemble of traditional feature-based and modern PLM-based methods on an AA task in a small sample.
arXiv Detail & Related papers (2025-04-11T13:40:50Z)
Selecting Between BERT and GPT for Text Classification in Political Science Research [4.487884986288122]
We evaluate the effectiveness of BERT-based versus GPT-based models in low-data scenarios. We conclude by comparing these approaches in terms of performance, ease of use, and cost.
arXiv Detail & Related papers (2024-11-07T07:29:39Z)
Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries. Experimental results show that our method improves consistently over existing methods. Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
RoBLEURT Submission for the WMT2021 Metrics Task [72.26898579202076]
We present our submission to the Shared Metrics Task: RoBLEURT. Our model reaches state-of-the-art correlations with the WMT 2020 human annotations upon 8 out of 10 to-English language pairs.
arXiv Detail & Related papers (2022-04-28T08:49:40Z)
Towards Zero-Label Language Learning [20.28186484098947]
This paper explores zero-label learning in Natural Language Processing (NLP) No human-annotated data is used anywhere during training and models are trained purely on synthetic data. Inspired by the recent success of few-shot inference on GPT-3, we present a training data creation procedure named Unsupervised Data Generation.
arXiv Detail & Related papers (2021-09-19T19:00:07Z)
BERT based sentiment analysis: A software engineering perspective [0.9176056742068814]
The paper presents three different strategies to analyse BERT based model for sentiment analysis. The experimental results show that the BERT based ensemble approach and the compressed BERT model attain improvements by 6-12% over prevailing tools for the F1 measure on all three datasets.
arXiv Detail & Related papers (2021-06-04T16:28:26Z)
Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models. Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z)
GPT-too: A language-model-first approach for AMR-to-text generation [22.65728041544785]
We propose an approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques.
arXiv Detail & Related papers (2020-05-18T22:50:26Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM. Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.