Phrase Mining
- URL: http://arxiv.org/abs/2206.13748v1
- Date: Tue, 28 Jun 2022 04:11:31 GMT
- Title: Phrase Mining
- Authors: Ellie Small, Javier Cabrera
- Abstract summary: We present a method that eliminates double-counting without the need to identify lists of quality phrases.
In the context of a set of texts, we define a principal phrase as a phrase that does not cross punctuation marks.
An R package called phm has been developed that implements this method.
- Score: 0.8223798883838329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting frequent words from a collection of texts is performed on a great
scale in many subjects. Extracting phrases, on the other hand, is not commonly
done due to inherent complications when extracting phrases, the most
significant complication being that of double-counting, where words or phrases
are counted when they appear inside longer phrases that themselves are also
counted. Several papers have been written on phrase mining that describe
solutions to this issue; however, they either require a list of so-called
quality phrases to be available to the extracting process, or they require
human interaction to identify those quality phrases during the process. We
present a method that eliminates double-counting without the need to identify
lists of quality phrases. In the context of a set of texts, we define a
principal phrase as a phrase that does not cross punctuation marks, does not
start with a stop word, with the exception of the stop words "not" and "no",
does not end with a stop word, is frequent within those texts without being
double counted, and is meaningful to the user. Our method can identify such
principal phrases independently without human input, and enables their
extraction from any texts. An R package called phm has been developed that
implements this method.
Related papers
- Levée d'ambiguïtés par grammaires locales [0.0]
This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993).<n>We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately.
arXiv Detail & Related papers (2025-10-28T15:38:22Z) - Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning [54.25737182568224]
DiaASQ aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue.<n>We introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level.
arXiv Detail & Related papers (2025-08-07T04:22:17Z) - Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z) - N-gram Boosting: Improving Contextual Biasing with Normalized N-gram
Targets [1.9908600514057855]
We present a two-step keyword boosting mechanism that works on normalized unigrams and n-grams rather than just single tokens.
This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech.
arXiv Detail & Related papers (2023-08-04T00:23:14Z) - Unsupervised extraction of local and global keywords from a single text [0.0]
We propose an unsupervised, corpus-independent method to extract keywords from a single text.
It is based on the spatial distribution of words and the response of this distribution to a random permutation of words.
arXiv Detail & Related papers (2023-07-26T07:36:25Z) - Conjunct Resolution in the Face of Verbal Omissions [51.220650412095665]
We propose a conjunct resolution task that operates directly on the text and makes use of a split-and-rephrase paradigm in order to recover the missing elements in the coordination structure.
We curate a large dataset, containing over 10K examples of naturally-occurring verbal omissions with crowd-sourced annotations.
We train various neural baselines for this task, and show that while our best method obtains decent performance, it leaves ample space for improvement.
arXiv Detail & Related papers (2023-05-26T08:44:02Z) - Sentence Identification with BOS and EOS Label Combinations [7.053475270377054]
We formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text.
We propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs.
Our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.
arXiv Detail & Related papers (2023-01-31T01:03:07Z) - Applying Transformer-based Text Summarization for Keyphrase Generation [2.28438857884398]
Keyphrases are crucial for searching and systematizing scholarly documents.
In this paper, we experiment with popular transformer-based models for abstractive text summarization.
We show that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERT.Score.
We also investigate several ordering strategies to target keyphrases.
arXiv Detail & Related papers (2022-09-08T13:01:52Z) - Hierarchical Context Tagging for Utterance Rewriting [51.251400047377324]
Methods that tag rather than linearly generate sequences have proven stronger in both in- and out-of-domain rewriting settings.
We propose a hierarchical context tagger that mitigates this issue by predicting slotted rules.
Experiments on several benchmarks show that HCT can outperform state-of-the-art rewriting systems by 2 BLEU points.
arXiv Detail & Related papers (2022-06-22T17:09:34Z) - Phrase Retrieval Learns Passage Retrieval, Too [77.57208968326422]
We study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents.
We show that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy.
We also show that phrase filtering and vector quantization can reduce the size of our index by 4-10x.
arXiv Detail & Related papers (2021-09-16T17:42:45Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces [60.58900627906269]
We propose a pre-train language model as the substitutes generator using sentence-pieces to craft adversarial examples in Chinese.
The substitutions in the generated adversarial examples are not characters or words but textit'pieces', which are more natural to Chinese readers.
arXiv Detail & Related papers (2020-12-29T14:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.