Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG),
its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly
evolving ecosystem of automated metrics, datasets, and human evaluation
standards. However, due to this moving target, new models often still evaluate
on divergent anglo-centric corpora with well-established, but flawed, metrics.
This disconnect makes it challenging to identify the limitations of current
models and opportunities for progress. Addressing this limitation, GEM provides
an environment in which models can easily be applied to a wide set of corpora
and evaluation strategies can be tested. Regular updates to the benchmark will
help NLG research become more multilingual and evolve the challenge alongside
This paper serves as the description of the initial release for which we are
organizing a shared task at our ACL 2021 Workshop and to which we invite the
entire NLG community to participate.