Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers
- URL: http://arxiv.org/abs/2405.07886v1
- Date: Mon, 13 May 2024 16:21:33 GMT
- Title: Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers
- Authors: Alena Tsanda, Elena Bruches,
- Abstract summary: The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization.
A feature of the dataset is its multimodal data, which includes texts, tables and figures.
- Score: 0.20482269513546458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.
Related papers
- A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model [15.596156608713347]
In real-world scenarios, news about an international event often involves multiple documents in different languages.
We construct a mixed-language multi-document news summarization dataset (MLMD-news)
This dataset contains four different languages and 10,992 source document cluster and target summary pairs.
arXiv Detail & Related papers (2024-10-13T08:15:33Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Constructing Multilingual Code Search Dataset Using Neural Machine
Translation [48.32329232202801]
We create a multilingual code search dataset in four natural and four programming languages.
Our results show that the model pre-trained with all natural and programming language data has performed best in most cases.
arXiv Detail & Related papers (2023-06-27T16:42:36Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z) - Dataset for Automatic Summarization of Russian News [0.0]
We present Gazeta, the first dataset for summarization of Russian news.
We demonstrate that the dataset is a valid task for methods of text summarization for Russian.
arXiv Detail & Related papers (2020-06-19T10:44:06Z) - MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset.
It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.