Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian
- URL: http://arxiv.org/abs/2409.10640v2
- Date: Wed, 18 Sep 2024 07:35:46 GMT
- Title: Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian
- Authors: Anna Glazkova, Dmitry Morozov,
- Abstract summary: We explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts.
Experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics & computer science, history, medicine, and linguistics.
The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language.
- Score: 1.565361244756411
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics & computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.
Related papers
- MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - Cross-Domain Robustness of Transformer-based Keyphrase Generation [1.8492669447784602]
A list of keyphrases is an important element of a text in databases and repositories of electronic documents.
In our experiments, abstractive text summarization models fine-tuned for keyphrase generation show quite high results for a target text corpus.
We present an evaluation of the fine-tuned BART models for the keyphrase selection task across six benchmark corpora.
arXiv Detail & Related papers (2023-12-17T12:27:15Z) - Automatic Aspect Extraction from Scientific Texts [0.9208007322096533]
We present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion.
We show that there are some differences in aspect representation in different domains, but our model was trained on a limited number of scientific domains, it is still able to generalize to new domains.
arXiv Detail & Related papers (2023-10-06T07:59:54Z) - NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts [51.64770549988806]
We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
arXiv Detail & Related papers (2023-05-25T13:05:52Z) - GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content [27.901155229342375]
We present a novel approach for detecting ChatGPT-generated vs. human-written text using language models.
Our models achieved remarkable results, with an accuracy of over 97% on the test dataset, as evaluated through various metrics.
arXiv Detail & Related papers (2023-05-13T17:12:11Z) - ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase
Generation Task [0.0]
Transformer-based language models, including ChatGPT, have demonstrated exceptional performance in various natural language generation tasks.
This study compares ChatGPT's keyphrase generation performance with state-of-the-art models, while also testing its potential as a solution for two significant challenges in the field.
arXiv Detail & Related papers (2023-04-27T13:25:43Z) - Pre-trained Language Models for Keyphrase Generation: A Thorough
Empirical Study [76.52997424694767]
We present an in-depth empirical study of keyphrase extraction and keyphrase generation using pre-trained language models.
We show that PLMs have competitive high-resource performance and state-of-the-art low-resource performance.
Further results show that in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models.
arXiv Detail & Related papers (2022-12-20T13:20:21Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - Keyword Extraction from Short Texts with~a~Text-To-Text Transfer
Transformer [0.0]
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish to the task of intrinsic and extrinsic keyword extraction from short text passages.
We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords.
arXiv Detail & Related papers (2022-09-28T11:31:43Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.