Multilingual Coreference Resolution in Multiparty Dialogue
- URL: http://arxiv.org/abs/2208.01307v2
- Date: Sun, 9 Jul 2023 02:06:43 GMT
- Title: Multilingual Coreference Resolution in Multiparty Dialogue
- Authors: Boyuan Zheng, Patrick Xia, Mahsa Yarmohammadi, Benjamin Van Durme
- Abstract summary: We create a large-scale dataset, Multilingual Multiparty Coref, for this task based on TV transcripts.
Due to the availability of gold-quality subtitles in multiple languages, we propose reusing the annotations to create silver coreference resolution data in other languages.
We find success both using it for data augmentation and training from scratch, which effectively simulates the zero-shot cross-lingual setting.
- Score: 29.92954906275944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multiparty dialogue datasets for entity coreference resolution are
nascent, and many challenges are still unaddressed. We create a large-scale
dataset, Multilingual Multiparty Coref (MMC), for this task based on TV
transcripts. Due to the availability of gold-quality subtitles in multiple
languages, we propose reusing the annotations to create silver coreference
resolution data in other languages (Chinese and Farsi) via annotation
projection. On the gold (English) data, off-the-shelf models perform relatively
poorly on MMC, suggesting that MMC has broader coverage of multiparty
coreference than prior datasets. On the silver data, we find success both using
it for data augmentation and training from scratch, which effectively simulates
the zero-shot cross-lingual setting.
Related papers
- Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model [14.39119862985503]
We aim to create a multilingual ALT system with available datasets.
Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario.
We evaluate the performance of the multilingual model in comparison to its monolingual counterparts.
arXiv Detail & Related papers (2024-06-25T15:02:32Z) - ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - Towards Multilingual Automatic Dialogue Evaluation [9.264022699972621]
The main limiting factor in the development of robust multilingual dialogue evaluation metrics is the lack of multilingual data.
We propose a workaround for this lack of data by leveraging a strong multilingual pretrained LLM and augmenting existing English dialogue data using Machine Translation.
We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data.
arXiv Detail & Related papers (2023-08-31T15:15:26Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset.
It is large-scale and offers culturally adapted dialogs in 4 languages.
We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z) - Soft Prompt Decoding for Multilingual Dense Retrieval [30.766917713997355]
We show that applying state-of-the-art approaches developed for cross-lingual information retrieval to MLIR tasks leads to sub-optimal performance.
This is due to the heterogeneous and imbalanced nature of multilingual collections.
We present KD-SPD, a novel soft prompt decoding approach for MLIR that implicitly "translates" the representation of documents in different languages into the same embedding space.
arXiv Detail & Related papers (2023-05-15T21:17:17Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.