Abstract: This paper reviews and summarizes human evaluation practices described in 97
style transfer papers with respect to three main evaluation aspects: style
transfer, meaning preservation, and fluency. In principle, evaluations by human
raters should be the most reliable. However, in style transfer papers, we find
that protocols for human evaluations are often underspecified and not
standardized, which hampers the reproducibility of research in this field and
progress toward better human and automatic evaluation methods.