X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and
Few-shot Agents
- URL: http://arxiv.org/abs/2306.17674v1
- Date: Fri, 30 Jun 2023 14:03:30 GMT
- Title: X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and
Few-shot Agents
- Authors: Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury,
Ga\"el de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam
Kumaraguru, Nasredine Semmar, Sina J. Semnani, Jiwon Seo, Vivek Seshadri,
Manish Shrivastava, Michael Sun, Aditya Yadavalli, Chaobin You, Deyi Xiong
and Monica S. Lam
- Abstract summary: We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages.
X-RiSAWOZ has more than 18,000 human-verified dialogue utterances for each language.
We develop a toolset to accelerate the post-editing of a new language dataset after translation.
- Score: 43.446606562545085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Task-oriented dialogue research has mainly focused on a few popular languages
like English and Chinese, due to the high dataset creation cost for a new
language. To reduce the cost, we apply manual editing to automatically
translated data. We create a new multilingual benchmark, X-RiSAWOZ, by
translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean;
and a code-mixed English-Hindi language. X-RiSAWOZ has more than 18,000
human-verified dialogue utterances for each language, and unlike most
multilingual prior work, is an end-to-end dataset for building
fully-functioning agents.
The many difficulties we encountered in creating X-RiSAWOZ led us to develop
a toolset to accelerate the post-editing of a new language dataset after
translation. This toolset improves machine translation with a hybrid entity
alignment technique that combines neural with dictionary-based methods, along
with many automated and semi-automated validation checks.
We establish strong baselines for X-RiSAWOZ by training dialogue agents in
the zero- and few-shot settings where limited gold data is available in the
target language. Our results suggest that our translation and post-editing
methodology and toolset can be used to create new high-quality multilingual
dialogue agents cost-effectively. Our dataset, code, and toolkit are released
open-source.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues [7.8378818005171125]
Given a large-scale dialogue data set in one language, we can automatically produce an effective semantic for other languages using machine translation.
We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values.
We show that the succinct representation reduces the compounding effect of translation errors.
arXiv Detail & Related papers (2021-11-04T01:08:14Z) - GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented
Dialogue Systems [66.92182084456809]
We introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset from an English ToD dataset.
Our method is based on translating dialogue templates and filling them with local entities in the target-language countries.
We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
arXiv Detail & Related papers (2021-10-14T19:33:04Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.