Large Language Models for Stemming: Promises, Pitfalls and Failures
- URL: http://arxiv.org/abs/2402.11757v1
- Date: Mon, 19 Feb 2024 01:11:44 GMT
- Title: Large Language Models for Stemming: Promises, Pitfalls and Failures
- Authors: Shuai Wang, Shengyao Zhuang, Guido Zuccon
- Abstract summary: We investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding.
We compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text.
- Score: 34.91311006478368
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text stemming is a natural language processing technique that is used to
reduce words to their base form, also known as the root form. The use of
stemming in IR has been shown to often improve the effectiveness of
keyword-matching models such as BM25. However, traditional stemming methods,
focusing solely on individual terms, overlook the richness of contextual
information. Recognizing this gap, in this paper, we investigate the promising
idea of using large language models (LLMs) to stem words by leveraging its
capability of context understanding. With this respect, we identify three
avenues, each characterised by different trade-offs in terms of computational
cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a
collection, i.e., the set of unique words that appear in the collection
(vocabulary stemming), (2) use LLMs to stem each document separately
(contextual stemming), and (3) use LLMs to extract from each document entities
that should not be stemmed, then use vocabulary stemming to stem the rest of
the terms (entity-based contextual stemming). Through a series of empirical
experiments, we compare the use of LLMs for stemming with that of traditional
lexical stemmers such as Porter and Krovetz for English text. We find that
while vocabulary stemming and contextual stemming fail to achieve higher
effectiveness than traditional stemmers, entity-based contextual stemming can
achieve a higher effectiveness than using Porter stemmer alone, under specific
conditions.
Related papers
- Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia [27.344665855217567]
Human readers can efficiently comprehend scrambled words, primarily by relying on word form.
While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear.
arXiv Detail & Related papers (2025-03-03T16:31:45Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models [10.640773460677542]
We explore using Large Language Models (LLMs) in generating keywords for items that are inferred from the items textual metadata.
Our modeling framework includes several stages to fine grain the results by avoiding outputting keywords that are non informative or sensitive.
We propose two variations of framework for generating extractive and abstractive themes for products in an E commerce setting.
arXiv Detail & Related papers (2023-12-01T20:13:08Z) - Unsupervised extraction of local and global keywords from a single text [0.0]
We propose an unsupervised, corpus-independent method to extract keywords from a single text.
It is based on the spatial distribution of words and the response of this distribution to a random permutation of words.
arXiv Detail & Related papers (2023-07-26T07:36:25Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Divide and Conquer: Text Semantic Matching with Disentangled Keywords
and Intents [19.035917264711664]
We propose a training strategy for text semantic matching by disentangling keywords from intents.
Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency.
arXiv Detail & Related papers (2022-03-06T07:48:24Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
Keywords extraction is called identifying words or phrases that express the main concepts of texts in best.
We use a combined approach that consists of two models of graph centrality features and textural features.
arXiv Detail & Related papers (2021-04-10T18:30:17Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Language-Independent Tokenisation Rivals Language-Specific Tokenisation
for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons.
Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources.
We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.