Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization
- URL: http://arxiv.org/abs/2204.13512v2
- Date: Fri, 29 Apr 2022 08:01:26 GMT
- Title: Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization
- Authors: Ruipeng Jia, Xingxing Zhang, Yanan Cao, Shi Wang, Zheng Lin, Furu Wei
- Abstract summary: In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
- Score: 80.94424037751243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In zero-shot multilingual extractive text summarization, a model is typically
trained on English summarization dataset and then applied on summarization
datasets of other languages. Given English gold summaries and documents,
sentence-level labels for extractive summarization are usually generated using
heuristics. However, these monolingual labels created on English datasets may
not be optimal on datasets of other languages, for that there is the syntactic
or semantic discrepancy between different languages. In this way, it is
possible to translate the English dataset to other languages and obtain
different sets of labels again using heuristics. To fully leverage the
information of these different sets of labels, we propose NLSSum (Neural Label
Search for Summarization), which jointly learns hierarchical weights for these
different sets of labels together with our summarization model. We conduct
multilingual zero-shot summarization experiments on MLSUM and WikiLingua
datasets, and we achieve state-of-the-art results using both human and
automatic evaluations across these two datasets.
Related papers
- Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification.
Our approach involves blending supervised data from different languages during training to create a universal model.
The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z) - Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language [7.59001382786429]
This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German.
Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset.
We train several state-of-the-art machine learning models on the automatically created dataset and release them as well.
arXiv Detail & Related papers (2024-03-25T19:40:26Z) - Visualizing Linguistic Diversity of Text Datasets Synthesized by Large
Language Models [9.808214545408541]
LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets.
It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
arXiv Detail & Related papers (2023-05-19T00:53:45Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.