MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue
- URL: http://arxiv.org/abs/2212.10455v2
- Date: Mon, 19 Jun 2023 04:09:37 GMT
- Title: MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for
Natural Language Understanding in Task-Oriented Dialogue
- Authors: Nikita Moghe, Evgeniia Razumovskaia, Liane Guillou, Ivan Vuli\'c, Anna
Korhonen, Alexandra Birch
- Abstract summary: We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages.
Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals.
We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
- Score: 115.32009638844059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Task-oriented dialogue (TOD) systems have been widely deployed in many
industries as they deliver more efficient customer support. These systems are
typically constructed for a single domain or language and do not generalise
well beyond this. To support work on Natural Language Understanding (NLU) in
TOD across multiple languages and domains simultaneously, we constructed
MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++
extends the English only NLU++ dataset to include manual translations into a
range of high, medium, and low resource languages (Spanish, Marathi, Turkish
and Amharic), in two domains (BANKING and HOTELS). Because of its multi-intent
property, MULTI3NLU++ represents complex and natural user goals, and therefore
allows us to measure the realistic performance of TOD systems in a varied set
of the world's languages. We use MULTI3NLU++ to benchmark state-of-the-art
multilingual models for the NLU tasks of intent detection and slot labelling
for TOD systems in the multilingual setting. The results demonstrate the
challenging nature of the dataset, particularly in the low-resource language
setting, offering ample room for future experimentation in multi-domain
multilingual TOD setups.
Related papers
- Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS)
It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair.
It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining
for Task-Oriented Dialog [67.20796950016735]
Multi2WOZ dataset spans four typologically diverse languages: Chinese, German, Arabic, and Russian.
We introduce a new framework for multilingual conversational specialization of pretrained language models (PrLMs) that aims to facilitate cross-lingual transfer for arbitrary downstream TOD tasks.
Our experiments show that, in most setups, the best performance entails the combination of (I) conversational specialization in the target language and (ii) few-shot transfer for the concrete TOD task.
arXiv Detail & Related papers (2022-05-20T18:35:38Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue
Modeling [52.99188200886738]
BiToD is the first bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling.
BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base.
arXiv Detail & Related papers (2021-06-05T03:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.