LR-Sum: Summarization for Less-Resourced Languages
- URL: http://arxiv.org/abs/2212.09674v2
- Date: Thu, 26 Oct 2023 19:50:29 GMT
- Title: LR-Sum: Summarization for Less-Resourced Languages
- Authors: Chester Palen-Michel and Constantine Lignos
- Abstract summary: This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset.
LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced.
The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0)
- Score: 12.605915166622818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This preprint describes work in progress on LR-Sum, a new
permissively-licensed dataset created with the goal of enabling further
research in automatic summarization for less-resourced languages. LR-Sum
contains human-written summaries for 40 languages, many of which are
less-resourced. We describe our process for extracting and filtering the
dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The
source data is public domain newswire collected from from Voice of America
websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0),
making it one of the most openly-licensed multilingual summarization datasets.
We describe how we plan to use the data for modeling experiments and discuss
limitations of the dataset.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model [15.596156608713347]
In real-world scenarios, news about an international event often involves multiple documents in different languages.
We construct a mixed-language multi-document news summarization dataset (MLMD-news)
This dataset contains four different languages and 10,992 source document cluster and target summary pairs.
arXiv Detail & Related papers (2024-10-13T08:15:33Z) - IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning
Datasets for Indian Languages [37.79850860981589]
This work introduces an expansive suite of resources specifically designed for the development of Indic LLMs.
Our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data.
For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models.
arXiv Detail & Related papers (2024-03-11T00:46:56Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - MTVR: Multilingual Moment Retrieval in Videos [89.24431389933703]
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips.
The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles.
We propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages.
arXiv Detail & Related papers (2021-07-30T20:01:03Z) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44
Languages [7.8288425529553916]
We present XL-Sum, a comprehensive and diverse dataset of 1 million professionally annotated article-summary pairs from BBC.
The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available.
XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
arXiv Detail & Related papers (2021-06-25T18:00:24Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.