Related papers: Large Language Models for Stemming: Promises, Pitfalls and Failures

Large Language Models for Stemming: Promises, Pitfalls and Failures

URL: http://arxiv.org/abs/2402.11757v1
Date: Mon, 19 Feb 2024 01:11:44 GMT
Title: Large Language Models for Stemming: Promises, Pitfalls and Failures
Authors: Shuai Wang, Shengyao Zhuang, Guido Zuccon
Abstract summary: We investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. We compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text.
Score: 34.91311006478368
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The use of stemming in IR has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information. Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions.

Related papers

Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia [27.344665855217567]
Human readers can efficiently comprehend scrambled words, primarily by relying on word form. While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear.
arXiv Detail & Related papers (2025-03-03T16:31:45Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models [10.640773460677542]
We explore using Large Language Models (LLMs) in generating keywords for items that are inferred from the items textual metadata. Our modeling framework includes several stages to fine grain the results by avoiding outputting keywords that are non informative or sensitive. We propose two variations of framework for generating extractive and abstractive themes for products in an E commerce setting.
arXiv Detail & Related papers (2023-12-01T20:13:08Z)
Unsupervised extraction of local and global keywords from a single text [0.0]
We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words.
arXiv Detail & Related papers (2023-07-26T07:36:25Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models. We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z)
Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents [19.035917264711664]
We propose a training strategy for text semantic matching by disentangling keywords from intents. Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency.
arXiv Detail & Related papers (2022-03-06T07:48:24Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
Keywords extraction is called identifying words or phrases that express the main concepts of texts in best. We use a combined approach that consists of two models of graph centrality features and textural features.
arXiv Detail & Related papers (2021-04-10T18:30:17Z)
Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change. We show that our method can be used for the detection of semantic change with any alignment method. We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z)
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons. Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.