AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining
- URL: http://arxiv.org/abs/2312.15784v1
- Date: Mon, 25 Dec 2023 18:23:03 GMT
- Title: AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining
- Authors: Boshko Koloski and Nada Lavra\v{c} and Bojan Cestnik and Senja Pollak
and Bla\v{z} \v{S}krlj and Andrej Kastrin
- Abstract summary: We present the AHAM' methodology and a metric that guides the domain-specific textbfadaptation of the BERTopic topic modeling framework.
By utilizing the LLaMa2 generative language model, we generate topic definitions via one-shot learning.
For inter-topic similarity evaluation, we leverage metrics from language generation and translation processes.
- Score: 3.8384235322772864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In an era marked by a rapid increase in scientific publications, researchers
grapple with the challenge of keeping pace with field-specific advances. We
present the `AHAM' methodology and a metric that guides the domain-specific
\textbf{adapt}ation of the BERTopic topic modeling framework to improve
scientific text analysis. By utilizing the LLaMa2 generative language model, we
generate topic definitions via one-shot learning by crafting prompts with the
\textbf{help} of domain experts to guide the LLM for literature mining by
\textbf{asking} it to model the topic names. For inter-topic similarity
evaluation, we leverage metrics from language generation and translation
processes to assess lexical and semantic similarity of the generated topics.
Our system aims to reduce both the ratio of outlier topics to the total number
of topics and the similarity between topic definitions. The methodology has
been assessed on a newly gathered corpus of scientific papers on
literature-based discovery. Through rigorous evaluation by domain experts, AHAM
has been validated as effective in uncovering intriguing and novel insights
within broad research areas. We explore the impact of domain adaptation of
sentence-transformers for the task of topic \textbf{model}ing using two
datasets, each specialized to specific scientific domains within arXiv and
medarxiv. We evaluate the impact of data size, the niche of adaptation, and the
importance of domain adaptation. Our results suggest a strong interaction
between domain adaptation and topic modeling precision in terms of outliers and
topic definitions.
Related papers
- Automating Bibliometric Analysis with Sentence Transformers and Retrieval-Augmented Generation (RAG): A Pilot Study in Semantic and Contextual Search for Customized Literature Characterization for High-Impact Urban Research [2.1728621449144763]
Bibliometric analysis is essential for understanding research trends, scope, and impact in urban science.
Traditional methods, relying on keyword searches, often fail to uncover valuable insights not explicitly stated in article titles or keywords.
We leverage Generative AI models, specifically transformers and Retrieval-Augmented Generation (RAG), to automate and enhance bibliometric analysis.
arXiv Detail & Related papers (2024-10-08T05:13:27Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - Exploring the Power of Topic Modeling Techniques in Analyzing Customer
Reviews: A Comparative Analysis [0.0]
Machine learning and natural language processing algorithms have been deployed to analyze the vast amount of textual data available online.
In this study, we examine and compare five frequently used topic modeling methods specifically applied to customer reviews.
Our findings reveal that BERTopic consistently yield more meaningful extracted topics and achieve favorable results.
arXiv Detail & Related papers (2023-08-19T08:18:04Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Revise and Resubmit: An Intertextual Model of Text-based Collaboration
in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science.
Existing NLP studies focus on the analysis of individual texts.
editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Domain-adaptation of spherical embeddings [0.0]
We develop methods to counter the global rotation of the embedding space and propose strategies to update words and documents during domain specific training.
We show that our strategies are able to reduce the performance cost of domain adaptation to a level similar to Word2Vec.
arXiv Detail & Related papers (2021-11-01T03:29:36Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.