Abstract: The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for
benchmarking context-to-response abilities of task-oriented dialogue systems.
In this work, we identify inconsistencies in data preprocessing and reporting
of three corpus-based metrics used on this dataset, i.e., BLEU score and Inform
& Success rates. We point out a few problems of the MultiWOZ benchmark such as
unsatisfactory preprocessing, insufficient or under-specified evaluation
metrics, or rigid database. We re-evaluate 7 end-to-end and 6 policy
optimization models in as-fair-as-possible setups, and we show that their
reported scores cannot be directly compared. To facilitate comparison of future
systems, we release our stand-alone standardized evaluation scripts. We also
give basic recommendations for corpus-based benchmarking in future works.