Abstract: Automatic evaluations for natural language generation (NLG) conventionally
rely on token-level or embedding-level comparisons with the text references.
This is different from human language processing, for which visual imaginations
often improve comprehension. In this work, we propose ImaginE, an
imagination-based automatic evaluation metric for natural language generation.
With the help of CLIP and DALL-E, two cross-modal models pre-trained on
large-scale image-text pairs, we automatically generate an image as the
embodied imagination for the text snippet and compute the imagination
similarity using contextual embeddings. Experiments spanning several text
generation tasks demonstrate that adding imagination with our ImaginE displays
great potential in introducing multi-modal information into NLG evaluation, and
improves existing automatic metrics' correlations with human similarity
judgments in many circumstances.