Keyword Extraction from Short Texts with~a~Text-To-Text Transfer
Transformer
- URL: http://arxiv.org/abs/2209.14008v1
- Date: Wed, 28 Sep 2022 11:31:43 GMT
- Title: Keyword Extraction from Short Texts with~a~Text-To-Text Transfer
Transformer
- Authors: Piotr P\k{e}zik, Agnieszka Miko{\l}ajczyk-Bare{\l}a, Adam
Wawrzy\'nski, Bart{\l}omiej Nito\'n, Maciej Ogrodniczuk
- Abstract summary: The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish to the task of intrinsic and extrinsic keyword extraction from short text passages.
We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper explores the relevance of the Text-To-Text Transfer Transformer
language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic
keyword extraction from short text passages. The evaluation is carried out on
the new Polish Open Science Metadata Corpus (POSMAC), which is released with
this paper: a collection of 216,214 abstracts of scientific publications
compiled in the CURLICAT project. We compare the results obtained by four
different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that
the plT5kw model yields particularly promising results for both frequent and
sparsely represented keywords. Furthermore, a plT5kw keyword generation model
trained on the POSMAC also seems to produce highly useful results in
cross-domain text labelling scenarios. We discuss the performance of the model
on news stories and phone-based dialog transcripts which represent text genres
and domains extrinsic to the dataset of scientific abstracts. Finally, we also
attempt to characterize the challenges of evaluating a text-to-text model on
both intrinsic and extrinsic keyword extraction.
Related papers
- VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models [5.713983191152314]
VTechAGP is the first academic-to-general-audience text paraphrase dataset.
We also propose a novel dynamic soft prompt generative language model, DSPT5.
For training, we leverage a contrastive-generative loss function to learn the keyword in the dynamic prompt.
arXiv Detail & Related papers (2024-11-07T16:06:00Z) - Enhancing Automatic Keyphrase Labelling with Text-to-Text Transfer Transformer (T5) Architecture: A Framework for Keyphrase Generation and Filtering [2.1656586298989793]
This paper presents a keyphrase generation model based on the Text-to-Text Transfer Transformer (T5) architecture.
We also present a novel keyphrase filtering technique based on the T5 architecture.
arXiv Detail & Related papers (2024-09-25T09:16:46Z) - SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization [24.051692189473723]
We propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS)
We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences.
We employ the T5 family of models to generate abstractive summaries using extracted sentences.
arXiv Detail & Related papers (2024-02-27T08:33:31Z) - Cross-Domain Robustness of Transformer-based Keyphrase Generation [1.8492669447784602]
A list of keyphrases is an important element of a text in databases and repositories of electronic documents.
In our experiments, abstractive text summarization models fine-tuned for keyphrase generation show quite high results for a target text corpus.
We present an evaluation of the fine-tuned BART models for the keyphrase selection task across six benchmark corpora.
arXiv Detail & Related papers (2023-12-17T12:27:15Z) - GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content [27.901155229342375]
We present a novel approach for detecting ChatGPT-generated vs. human-written text using language models.
Our models achieved remarkable results, with an accuracy of over 97% on the test dataset, as evaluated through various metrics.
arXiv Detail & Related papers (2023-05-13T17:12:11Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.