EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain
- URL: http://arxiv.org/abs/2210.13448v1
- Date: Mon, 24 Oct 2022 17:58:59 GMT
- Title: EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form
Summarization in the Legal Domain
- Authors: Dennis Aumiller and Ashish Chouhan and Michael Gertz
- Abstract summary: We propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex)
Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages.
We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages.
- Score: 2.4815579733050157
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing summarization datasets come with two main drawbacks: (1) They tend
to focus on overly exposed domains, such as news articles or wiki-like texts,
and (2) are primarily monolingual, with few multilingual datasets. In this
work, we propose a novel dataset, called EUR-Lex-Sum, based on manually curated
document summaries of legal acts from the European Union law platform
(EUR-Lex). Documents and their respective summaries exist as cross-lingual
paragraph-aligned data in several of the 24 official European languages,
enabling access to various cross-lingual and lower-resourced summarization
setups. We obtain up to 1,500 document/summary pairs per language, including a
subset of 375 cross-lingually aligned legal acts with texts available in all 24
languages. In this work, the data acquisition process is detailed and key
characteristics of the resource are compared to existing summarization
resources. In particular, we illustrate challenging sub-problems and open
questions on the dataset that could help the facilitation of future research in
the direction of domain-specific cross-lingual summarization. Limited by the
extreme length and language diversity of samples, we further conduct
experiments with suitable extractive monolingual and cross-lingual baselines
for future work. Code for the extraction as well as access to our data and
baselines is available online at: https://github.com/achouhan93/eur-lex-sum.
Related papers
- UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Automatic Data Retrieval for Cross Lingual Summarization [4.759360739268894]
Cross-lingual summarization involves the summarization of text written in one language to a different one.
In this work, we aim to perform cross-lingual summarization from English to Hindi.
arXiv Detail & Related papers (2023-12-22T09:13:24Z) - $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - MultiEURLEX -- A multi-lingual and multi-label legal document
classification dataset for zero-shot cross-lingual transfer [13.24356999779404]
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents.
The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy.
We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target)
arXiv Detail & Related papers (2021-09-02T12:52:55Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.