Fugu-MT 論文翻訳(概要): Large Language Models for Stemming: Promises, Pitfalls and Failures

論文の概要: Large Language Models for Stemming: Promises, Pitfalls and Failures

arxiv url: http://arxiv.org/abs/2402.11757v1
Date: Mon, 19 Feb 2024 01:11:44 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-20 18:52:53.146933
Title: Large Language Models for Stemming: Promises, Pitfalls and Failures
Title（参考訳）: ステミングのための大規模言語モデル:約束、落とし穴、失敗
Authors: Shuai Wang, Shengyao Zhuang, Guido Zuccon
Abstract要約: 本研究では,文脈理解の能力を活用して,大言語モデル(LLM)を用いて単語を綴じるという有望なアイデアについて検討する。我々は,LLMを幹細胞として用いることと,Porter や Krovetz といった従来の語彙ステムマーを英語のテキストとして用いることと比較した。
参考スコア（独自算出の注目度）: 34.91311006478368
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The use of stemming in IR has been shown to often improve the effectiveness of keyword-matching models such as BM25. However, traditional stemming methods, focusing solely on individual terms, overlook the richness of contextual information. Recognizing this gap, in this paper, we investigate the promising idea of using large language models (LLMs) to stem words by leveraging its capability of context understanding. With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). Through a series of empirical experiments, we compare the use of LLMs for stemming with that of traditional lexical stemmers such as Porter and Krovetz for English text. We find that while vocabulary stemming and contextual stemming fail to achieve higher effectiveness than traditional stemmers, entity-based contextual stemming can achieve a higher effectiveness than using Porter stemmer alone, under specific conditions.
Abstract（参考訳）: テキスト・ステミング(英: Text stemming)は、自然言語処理の技法で、単語を基本形に減らすために用いられる。 IRにおける幹細胞の使用は、BM25のようなキーワードマッチングモデルの有効性を改善することがしばしば示されている。しかし、個々の用語のみに焦点をあてた伝統的なステーミング手法は、文脈情報の豊かさを見落としている。本稿では,このギャップを認識し,文脈理解の能力を活用して,大言語モデル(LLM)を用いて単語を綴じるという有望なアイデアを考察する。 With this respect, we identify three avenues, each characterised by different trade-offs in terms of computational cost, effectiveness and robustness : (1) use LLMs to stem the vocabulary for a collection, i.e., the set of unique words that appear in the collection (vocabulary stemming), (2) use LLMs to stem each document separately (contextual stemming), and (3) use LLMs to extract from each document entities that should not be stemmed, then use vocabulary stemming to stem the rest of the terms (entity-based contextual stemming). 一連の経験的実験を通じて、英語のテキストに対して、Porter や Krovetz のような従来の語彙的ステムマーのステミングに LLM を用いることを比較した。語彙の茎と文脈の茎は従来の茎語よりも高い効果を得られないが、エンティティベースの茎語は特定の条件下ではポーターの茎語のみを使うよりも高い効果が得られる。

関連論文リスト

Improving Contextual ASR via Multi-grained Fusion with Large Language Models [12.755830619473368]
本稿では,Large Language Models (LLMs) によるトークンレベルとフレーズレベルの融合の強みを両立させる,新しい多層融合手法を提案する。提案手法は,ASRの音響情報とLLMの豊富な文脈知識を組み合わせ,詳細なトークン精度と全体論的フレーズレベルの理解のバランスをとる,遅延融合戦略を取り入れたものである。中国語と英語のデータセットを用いた実験により,キーワード関連メトリクスの最先端性能が得られた。
論文参考訳（メタデータ） (2025-07-16T13:59:32Z)
Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models [22.297388572921477]
本稿では、一般ドメインと対象ドメインの単言語コーパスを用いて、ドメイン固有のバイリンガル辞書を抽出するBLIの新しいタスクを提案する。事前学習モデルの能力に触発されて,BLIの最近の研究に基づいて構築された単語の埋め込みを改善する手法を提案する。実験結果から,本手法は3つの領域におけるロバストなBLIベースラインの性能を平均0.78ポイント向上させることで向上できることがわかった。
論文参考訳（メタデータ） (2025-05-29T06:37:02Z)
Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia [27.344665855217567]
人間の読み手は、主に単語形式に依存して、スクランブルされた単語を効率的に理解することができる。先進的な大規模言語モデル(LLM)も同様の能力を示すが、その基盤となるメカニズムはいまだ不明である。
論文参考訳（メタデータ） (2025-03-03T16:31:45Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
我々は,テキストを多言語セマンティックマッチングのためのマルチコンセプトに分解し,NERモデルに依存するモデルからモデルを解放することを提案する。英語データセットのQQPとMRPC、中国語データセットのMedical-SMについて包括的な実験を行った。
論文参考訳（メタデータ） (2024-03-05T13:55:16Z)
LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models [10.640773460677542]
項目のテキストメタデータから推測される項目のキーワードを生成するために,Large Language Models (LLMs) を用いて検討する。我々のモデリングフレームワークは、非情報的またはセンシティブなキーワードを出力することを避けて結果を微粒化するいくつかの段階を含む。本稿では,Eコマース環境における商品の抽出的および抽象的テーマを生成するための2つのフレームワークを提案する。
論文参考訳（メタデータ） (2023-12-01T20:13:08Z)
Unsupervised extraction of local and global keywords from a single text [0.0]
テキストからキーワードを抽出する非教師付きコーパス非依存手法を提案する。それは、単語の空間分布と、単語のランダムな置換に対するこの分布の応答に基づいている。
論文参考訳（メタデータ） (2023-07-26T07:36:25Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
複合語を構成語に分割する作業である「分解」を体系的に研究する。 We introduced a dataset of 255k compound and non-compound words across 56 various languages obtained from Wiktionary。分割のための専用モデルを訓練するための新しい手法を導入する。
論文参考訳（メタデータ） (2023-05-23T16:32:27Z)
Always Keep your Target in Mind: Studying Semantics and Improving Performance of Neural Lexical Substitution [124.99894592871385]
本稿では,従来の言語モデルと最近の言語モデルの両方を用いた語彙置換手法の大規模比較研究を行う。目的語に関する情報を適切に注入すれば,SOTA LMs/MLMsによるすでに競合する結果がさらに大幅に改善できることを示す。
論文参考訳（メタデータ） (2022-06-07T16:16:19Z)
Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents [19.035917264711664]
本稿では,キーワードを意図から切り離してテキストセマンティックマッチングを行うためのトレーニング戦略を提案する。提案手法は,予測効率に影響を与えることなく,事前学習言語モデル(PLM)と容易に組み合わせることができる。
論文参考訳（メタデータ） (2022-03-06T07:48:24Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
モデルが異なる環境でクラスタリングの品質を測定するための新しい指標を提案する。マージトークンでトレーニングされたトピックは、マージされていないモデルよりも、より明確で、一貫性があり、トピックを区別する効果が高いトピックキーをもたらすことを示す。
論文参考訳（メタデータ） (2021-08-24T14:08:19Z)
FRAKE: Fusional Real-time Automatic Keyword Extraction [1.332091725929965]
キーワード抽出は、テキストの主要な概念を最もよく表す単語やフレーズを識別する。グラフ中心性特徴とテキスト特徴の2つのモデルを組み合わせたアプローチを採用している。
論文参考訳（メタデータ） (2021-04-10T18:30:17Z)
Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks [58.87961226278285]
語彙意味変化をモデル化するための自己教師付きアプローチを提案する。本手法は,任意のアライメント法を用いて意味変化の検出に利用できることを示す。 3つの異なるデータセットに対する実験結果を用いて,本手法の有用性について述べる。
論文参考訳（メタデータ） (2021-01-30T18:59:43Z)
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction [12.376752724719005]
言語に依存しないトークン化(LIT)メソッドはラベル付き言語リソースや語彙を必要としない。言語固有のトークン化(LST)手法は、長い歴史と確立された歴史を持ち、慎重に作成された語彙とトレーニングリソースを用いて開発されている。意味的類似度測定を多種多様な言語を対象とした評価課題として用いた2つの手法を実証的に比較した。
論文参考訳（メタデータ） (2020-02-25T16:24:42Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。