Extending Neural Keyword Extraction with TF-IDF tagset matching
- URL: http://arxiv.org/abs/2102.00472v1
- Date: Sun, 31 Jan 2021 15:39:17 GMT
- Title: Extending Neural Keyword Extraction with TF-IDF tagset matching
- Authors: Boshko Koloski and Senja Pollak and Bla\v{z} \v{S}krlj and Matej
Martinc
- Abstract summary: Keywords extraction is a task of identifying words that best describe a given document and serve in news portals to link articles of similar topics.
In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry.
- Score: 4.014524824655106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Keyword extraction is the task of identifying words (or multi-word
expressions) that best describe a given document and serve in news portals to
link articles of similar topics. In this work we develop and evaluate our
methods on four novel data sets covering less represented, morphologically-rich
languages in European news media industry (Croatian, Estonian, Latvian and
Russian). First, we perform evaluation of two supervised neural
transformer-based methods (TNT-KID and BERT+BiLSTM CRF) and compare them to a
baseline TF-IDF based unsupervised approach. Next, we show that by combining
the keywords retrieved by both neural transformer based methods and extending
the final set of keywords with an unsupervised TF-IDF based technique, we can
drastically improve the recall of the system, making it appropriate to be used
as a recommendation system in the media house environment.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - Word Sense Induction with Knowledge Distillation from BERT [6.88247391730482]
This paper proposes a method to distill multiple word senses from a pre-trained language model (BERT) by using attention over the senses of a word in a context.
Experiments on the contextual word similarity and sense induction tasks show that this method is superior to or competitive with state-of-the-art multi-sense embeddings.
arXiv Detail & Related papers (2023-04-20T21:05:35Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Self-Supervised Detection of Contextual Synonyms in a Multi-Class
Setting: Phenotype Annotation Use Case [11.912581294872767]
Contextualised word embeddings is a powerful tool to detect contextual synonyms.
We propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching.
arXiv Detail & Related papers (2021-09-04T21:35:01Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Deep Transformer based Data Augmentation with Subword Units for
Morphologically Rich Online ASR [0.0]
Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR.
Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation based data augmentation.
We show that although data augmentation with Transformer-generated text works well for isolating languages, it causes a vocabulary explosion in a morphologically rich language.
We propose a new method called subword-based neural text augmentation, where we retokenize the generated text into statistically derived subwords.
arXiv Detail & Related papers (2020-07-14T10:22:05Z) - Transformer Based Language Models for Similar Text Retrieval and Ranking [0.0]
We introduce novel approaches for effectively applying neural transformer models to similar text retrieval and ranking.
By eliminating the bag-of-words-based step, our approach is able to accurately retrieve and rank results even when they have no non-stopwords in common with the query.
arXiv Detail & Related papers (2020-05-10T06:12:53Z) - TNT-KID: Transformer-based Neural Tagger for Keyword Identification [7.91883337742071]
We present a novel algorithm for keyword identification called Transformer-based Neural Tagger for Keyword IDentification (TNT-KID)
By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction.
arXiv Detail & Related papers (2020-03-20T09:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.