ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search
- URL: http://arxiv.org/abs/2412.14533v1
- Date: Thu, 19 Dec 2024 05:11:16 GMT
- Title: ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search
- Authors: Ashish Chouhan, Saifeldin Mandour, Michael Gertz,
- Abstract summary: ClusterTalk is a framework for corpus exploration using multi-dimensional exploratory search.<n>Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries.
- Score: 3.4123736336071864
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents ClusterTalk (The demo video and source code are available at: https://github.com/achouhan93/ClusterTalk), a framework for corpus exploration using multi-dimensional exploratory search. Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries. Compared to traditional one-dimensional search approaches like keyword search or clustering, this system improves the discoverability of information by encouraging a deeper interaction with the corpus. We demonstrate the functionality of the ClusterTalk framework based on four million PubMed abstracts for the four-year time frame.
Related papers
- Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval [22.33550491040999]
RAG grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents.<n>We build two plug-and-play retrievers: StatementGraphRAG and TopicGraphRAG.<n>Our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness.
arXiv Detail & Related papers (2025-06-09T17:58:35Z) - ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework [73.91207117772291]
ManuSearch is a transparent and modular multi-agent framework designed to democratize deep search for large language models (LLMs)<n>ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content.
arXiv Detail & Related papers (2025-05-23T17:02:02Z) - Ranking Narrative Query Graphs for Biomedical Document Retrieval (Technical Report) [7.527096697768715]
This paper extends our existing graph-based discovery system for the biomedical domain.<n>It contributes effective graph-based unsupervised ranking methods, a new query relaxation paradigm, and ontological rewriting.
arXiv Detail & Related papers (2024-12-06T12:49:28Z) - PseudoSeer: a Search Engine for Pseudocode [18.726136894285403]
A novel pseudocode search engine is designed to facilitate efficient retrieval and search of academic papers containing pseudocode.
By leveraging snippets, the system enables users to search across various facets of a paper, such as the title, abstract, author information, and code snippets.
A weighted BM25-based ranking algorithm is used by the search engine, and factors considered when prioritizing search results are described.
arXiv Detail & Related papers (2024-11-19T16:58:03Z) - Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - Organizing Unstructured Image Collections using Natural Language [37.16101036513514]
We introduce the task of Open-ended Semantic Multiple Clustering, that aims to automatically discover clustering criteria from large, unstructured image collections.
Our framework, X-Cluster, uses text as a proxy to concurrently reason over large image collections, discover clustering criteria, and reveal semantic substructures.
We apply X-Cluster to various real-world applications, such as discovering biases and analyzing social media image popularity.
arXiv Detail & Related papers (2024-10-07T17:21:46Z) - ELCC: the Emergent Language Corpus Collection [1.6574413179773761]
The Emergent Language Corpus Collection (ELCC) is a collection of corpora generated from open source implementations of emergent communication systems.<n>Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus.
arXiv Detail & Related papers (2024-07-04T21:23:18Z) - Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design [6.8863648800930655]
Text corpus exploration (TCE) spans the range of exploratory search tasks.
Current systems lack the flexibility to support the range of tasks encountered in practice.
We provide methods that enhance TCE tools with post hoc explanations and multiscale, comparative designs.
arXiv Detail & Related papers (2024-06-14T03:13:58Z) - Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - DiscoverPath: A Knowledge Refinement and Retrieval System for
Interdisciplinarity on Biomedical Research [96.10765714077208]
Traditional keyword-based search engines fall short in assisting users who may not be familiar with specific terminologies.
We present a knowledge graph-based paper search engine for biomedical research to enhance the user experience.
The system, dubbed DiscoverPath, employs Named Entity Recognition (NER) and part-of-speech (POS) tagging to extract terminologies and relationships from article abstracts to create a KG.
arXiv Detail & Related papers (2023-09-04T20:52:33Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - MICO: Selective Search with Mutual Information Co-training [14.456028769565386]
MICO is a Mutual Information CO-training framework for selective search.
After training, MICO does not only cluster the documents, but also routes unseen queries to the relevant clusters for efficient retrieval.
arXiv Detail & Related papers (2022-09-09T16:26:52Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms.
Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time.
Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z) - A New Neural Search and Insights Platform for Navigating and Organizing
AI Research [56.65232007953311]
We introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature.
We give an overview of the overall architecture of the system and of the components for document analysis, question answering, search, analytics, expert search, and recommendations.
arXiv Detail & Related papers (2020-10-30T19:12:25Z) - Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval [117.07047313964773]
We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions.
Our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers.
Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.
arXiv Detail & Related papers (2020-09-27T06:12:29Z) - A Feature Analysis for Multimodal News Retrieval [9.269820020286382]
We consider five feature types for image and text and compare the performance of the retrieval system using different combinations.
Experimental results show that retrieval results can be improved when considering both visual and textual information.
arXiv Detail & Related papers (2020-07-13T14:09:29Z) - Interactive Extractive Search over Biomedical Corpora [41.72755714431404]
We present a system that allows life-science researchers to search a linguistically annotated corpus of texts.
We introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations.
Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine.
arXiv Detail & Related papers (2020-06-07T13:26:32Z) - Search Result Clustering in Collaborative Sound Collections [17.48516881308658]
We propose a graph-based approach using audio features for clustering diverse sound collections obtained when querying large online databases.
We show that using a confidence measure for discarding inconsistent clusters improves the quality of the partitions.
arXiv Detail & Related papers (2020-04-08T13:08:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.