Harnessing Large Language Models for Scientific Novelty Detection
- URL: http://arxiv.org/abs/2505.24615v1
- Date: Fri, 30 May 2025 14:08:13 GMT
- Title: Harnessing Large Language Models for Scientific Novelty Detection
- Authors: Yan Liu, Zonglin Yang, Soujanya Poria, Thanh-Son Nguyen, Erik Cambria,
- Abstract summary: We propose to harness large language models (LLMs) for scientific novelty detection (ND)<n>To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs.<n> Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks.
- Score: 49.10608128661251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.
Related papers
- Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science [25.857554476782827]
This paper explores how augmenting large language models with relevant data during the idea generation process can enhance the quality of generated ideas.<n>We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%.<n>A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality.
arXiv Detail & Related papers (2025-05-27T16:23:42Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - LANID: LLM-assisted New Intent Discovery [18.15557766598695]
New Intent Discovery (NID) is a crucial task that aims to identify novel intents while maintaining the capability to recognize existing ones.<n>Previous efforts to adapt TODS to new intents have struggled with inadequate semantic representation.<n>We propose LANID, a framework that enhances the semantic representation of lightweight NID encoders with the guidance of Large Language Models.
arXiv Detail & Related papers (2025-03-31T05:34:32Z) - Enhancing LLM Reasoning with Reward-guided Tree Search [95.06503095273395]
o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research.<n>We present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms.
arXiv Detail & Related papers (2024-11-18T16:15:17Z) - IdeaBench: Benchmarking Large Language Models for Research Idea Generation [19.66218274796796]
Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems.
We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework.
Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works.
Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
arXiv Detail & Related papers (2024-10-31T17:04:59Z) - SciPIP: An LLM-based Scientific Paper Idea Proposer [30.670219064905677]
We introduce SciPIP, an innovative framework designed to enhance the proposal of scientific ideas through improvements in both literature retrieval and idea generation.<n>Our experiments, conducted across various domains such as natural language processing and computer vision, demonstrate SciPIP's capability to generate a multitude of innovative and useful ideas.
arXiv Detail & Related papers (2024-10-30T16:18:22Z) - Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents [64.64280477958283]
An exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions.
Recent developments in large language models(LLMs) suggest a promising avenue for automating the generation of novel research ideas.
We propose a Chain-of-Ideas(CoI) agent, an LLM-based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain.
arXiv Detail & Related papers (2024-10-17T03:26:37Z) - Advancing Academic Knowledge Retrieval via LLM-enhanced Representation Similarity Fusion [7.195738513912784]
This paper introduces the LLM-KnowSimFuser proposed by Robo Space, which wins the 2nd place in the KDD Cup 2024 Challenge.
With inspirations drawed from the superior performance of LLMs on multiple tasks, we firstly perform fine-tuning and inference using LLM-enhanced pre-trained retrieval models.
Experiments conducted on the competition datasets show the superiority of our proposal, which achieved a score of 0.20726 on the final leaderboard.
arXiv Detail & Related papers (2024-10-14T12:49:13Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.