ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization
- URL: http://arxiv.org/abs/2202.05599v1
- Date: Fri, 11 Feb 2022 13:32:14 GMT
- Title: ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization
- Authors: Jiaan Wang, Fandong Meng, Ziyao Lu, Duo Zheng, Zhixu Li, Jianfeng Qu,
Jie Zhou
- Abstract summary: We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents.
It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages.
- Score: 41.68574396739112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ClidSum, a benchmark dataset for building cross-lingual
summarization systems on dialogue documents. It consists of 67k+ dialogue
documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated
summaries in different target languages. Based on the proposed ClidSum, we
introduce two benchmark settings for supervised and semi-supervised scenarios,
respectively. We then build various baseline systems in different paradigms
(pipeline and end-to-end) and conduct extensive experiments on ClidSum to
provide deeper analyses. Furthermore, we propose mDialBART which extends
mBART-50 (a multi-lingual BART) via further pre-training. The multiple
objectives used in the further pre-training stage help the pre-trained model
capture the structural characteristics as well as important content in
dialogues and the transformation from source to the target language.
Experimental results show the superiority of mDialBART, as an end-to-end model,
outperforms strong pipeline models on ClidSum. Finally, we discuss specific
challenges that current approaches faced with this task and give multiple
promising directions for future research. We have released the dataset and code
at https://github.com/krystalan/ClidSum.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - MLS-Track: Multilevel Semantic Interaction in RMOT [31.153018571396206]
We propose a high-quality yet low-cost data generation method base on Unreal Engine 5.
We construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos.
We also propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer.
arXiv Detail & Related papers (2024-04-18T09:31:03Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+
Language Pairs [27.574815708395203]
CrossSum is a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs.
We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset.
We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language.
arXiv Detail & Related papers (2021-12-16T11:40:36Z) - A Survey of Recent Abstract Summarization Techniques [0.0]
We investigate the impact of pre-training models on several Wikipedia datasets in English and Indonesian language.
The most significant factors that influence ROUGE performance are coverage, density, and compression.
The T5-Large, the Pegasus-XSum, and the ProphetNet-CNNDM provide the best summarization.
arXiv Detail & Related papers (2021-04-15T20:01:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.