Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation
- URL: http://arxiv.org/abs/2201.13405v1
- Date: Mon, 31 Jan 2022 18:11:21 GMT
- Title: Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation
- Authors: Olga Majewska, Evgeniia Razumovskaia, Edoardo Maria Ponti, Ivan
Vuli\'c, Anna Korhonen
- Abstract summary: Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
- Score: 70.81596088969378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual task-oriented dialogue (ToD) facilitates access to services and
information for many (communities of) speakers. Nevertheless, the potential of
this technology is not fully realised, as current datasets for multilingual ToD
- both for modular and end-to-end modelling - suffer from severe limitations.
1) When created from scratch, they are usually small in scale and fail to cover
many possible dialogue flows. 2) Translation-based ToD datasets might lack
naturalness and cultural specificity in the target language. In this work, to
tackle these limitations we propose a novel outline-based annotation process
for multilingual ToD datasets, where domain-specific abstract schemata of
dialogue are mapped into natural language outlines. These in turn guide the
target language annotators in writing a dialogue by providing instructions
about each turn's intents and slots. Through this process we annotate a new
large-scale dataset for training and evaluation of multilingual and
cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset
(termed COD) enables natural language understanding, dialogue state tracking,
and end-to-end dialogue modelling and evaluation in 4 diverse languages:
Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative
analyses of COD versus an equivalent translation-based dataset demonstrate
improvements in data quality, unlocked by the outline-based approach. Finally,
we benchmark a series of state-of-the-art systems for cross-lingual ToD,
setting reference scores for future work and demonstrating that COD prevents
over-inflated performance, typically met with prior translation-based ToD
datasets.
Related papers
- LaDA: Latent Dialogue Action For Zero-shot Cross-lingual Neural Network
Language Modeling [20.002861239367704]
Cross-lingual adaptation has proven effective in spoken language understanding systems with limited resources.
Existing methods are frequently unsatisfactory for intent detection and slot filling.
Latent Dialogue Action layer is proposed to optimize decoding strategy.
arXiv Detail & Related papers (2023-08-05T15:51:45Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining
for Task-Oriented Dialog [67.20796950016735]
Multi2WOZ dataset spans four typologically diverse languages: Chinese, German, Arabic, and Russian.
We introduce a new framework for multilingual conversational specialization of pretrained language models (PrLMs) that aims to facilitate cross-lingual transfer for arbitrary downstream TOD tasks.
Our experiments show that, in most setups, the best performance entails the combination of (I) conversational specialization in the target language and (ii) few-shot transfer for the concrete TOD task.
arXiv Detail & Related papers (2022-05-20T18:35:38Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue
Modeling [52.99188200886738]
BiToD is the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling.
BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base.
arXiv Detail & Related papers (2021-06-05T03:38:42Z) - Crossing the Conversational Chasm: A Primer on Multilingual
Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry.
Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z) - An Empirical Study of Cross-Lingual Transferability in Generative
Dialogue State Tracker [33.2309643963072]
We study the transferability of a cross-lingual generative dialogue state tracking system using a multilingual pre-trained seq2seq model.
We also find out the low cross-lingual transferability of our approaches and provides investigation and discussion.
arXiv Detail & Related papers (2021-01-27T12:45:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.