HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
- URL: http://arxiv.org/abs/2406.03897v2
- Date: Mon, 10 Jun 2024 05:45:25 GMT
- Title: HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
- Authors: Tzuf Paz-Argaman, Itai Mondshine, Asaf Achi Mordechai, Reut Tsarfaty,
- Abstract summary: HeSum is a benchmark specifically designed for abstractive text summarization in Modern Hebrew.
HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals.
Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges.
- Score: 12.320161893898735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.
Related papers
- Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Echoes from Alexandria: A Large Resource for Multilingual Book
Summarization [99.86355187131349]
"Echoes from Alexandria" is a large resource for multilingual book summarization.
Echoes features three novel datasets: i) Echo-Wiki, for multilingual book summarization, ii) Echo-XSum, for extremely-compressive multilingual book summarization, andiii) Echo-FairySum, for extractive book summarization.
arXiv Detail & Related papers (2023-06-07T11:01:39Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - ParaShoot: A Hebrew Question Answering Dataset [22.55706811131828]
ParaShoot is the first question-answering dataset in modern Hebrew.
We provide the first baseline results using recently-released BERT-style models for Hebrew.
arXiv Detail & Related papers (2021-09-23T11:59:38Z) - Generalising Multilingual Concept-to-Text NLG with Language Agnostic
Delexicalisation [0.40611352512781856]
Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language.
We propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings.
Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text.
arXiv Detail & Related papers (2021-05-07T17:48:53Z) - Neural Abstractive Text Summarizer for Telugu Language [0.0]
The proposed architecture is based on encoder-decoder sequential models with attention mechanism.
We have applied this model on manually created dataset to generate a one sentence summary of the source text.
arXiv Detail & Related papers (2021-01-18T15:22:50Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.