Related papers: PseudoSeer: a Search Engine for Pseudocode

PseudoSeer: a Search Engine for Pseudocode

URL: http://arxiv.org/abs/2411.12649v1
Date: Tue, 19 Nov 2024 16:58:03 GMT
Title: PseudoSeer: a Search Engine for Pseudocode
Authors: Levent Toksoz, Mukund Srinath, Gang Tan, C. Lee Giles,
Abstract summary: A novel pseudocode search engine is designed to facilitate efficient retrieval and search of academic papers containing pseudocode. By leveraging snippets, the system enables users to search across various facets of a paper, such as the title, abstract, author information, and code snippets. A weighted BM25-based ranking algorithm is used by the search engine, and factors considered when prioritizing search results are described.
Score: 18.726136894285403
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A novel pseudocode search engine is designed to facilitate efficient retrieval and search of academic papers containing pseudocode. By leveraging Elasticsearch, the system enables users to search across various facets of a paper, such as the title, abstract, author information, and LaTeX code snippets, while supporting advanced features like combined facet searches and exact-match queries for more targeted results. A description of the data acquisition process is provided, with arXiv as the primary data source, along with methods for data extraction and text-based indexing, highlighting how different data elements are stored and optimized for search. A weighted BM25-based ranking algorithm is used by the search engine, and factors considered when prioritizing search results for both single and combined facet searches are described. We explain how each facet is weighted in a combined search. Several search engine results pages are displayed. Finally, there is a brief overview of future work and potential evaluation methodology for assessing the effectiveness and performance of the search engine is described.

Related papers

Forward Index Compression for Learned Sparse Retrieval [15.629655228398567]
We focus on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index.<n>In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency.
arXiv Detail & Related papers (2026-02-05T08:35:17Z)
Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval [68.71038700559195]
Chain of Retrieval(COR) is a novel iterative framework for full-paper retrieval.<n>We present SCIBENCH, a benchmark providing both complete and segmented contexts of full papers for queries and candidates.
arXiv Detail & Related papers (2025-07-14T08:41:53Z)
Dense Passage Retrieval in Conversational Search [0.0]
We present a new method called dense retrieval, which uses a dual-encoder to create contextual embeddings that can be indexed and clustered efficiently at run-time. We propose an end-to-end conversational search system called GPT2QR+DPR, which incorporates various query reformulation strategies to improve retrieval accuracy. Our work contributes to the growing body of research on neural-based retrieval methods in conversational search, and highlights the potential of dense retrieval in improving retrieval accuracy in conversational search systems.
arXiv Detail & Related papers (2025-03-21T19:39:31Z)
Ranking Narrative Query Graphs for Biomedical Document Retrieval (Technical Report) [7.527096697768715]
This paper extends our existing graph-based discovery system for the biomedical domain. It contributes effective graph-based unsupervised ranking methods, a new query relaxation paradigm, and ontological rewriting.
arXiv Detail & Related papers (2024-12-06T12:49:28Z)
VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search [1.0411820336052784]
We propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics.
arXiv Detail & Related papers (2024-09-25T21:58:08Z)
Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling. We generate supplemental training pairs by altering the most important part of a search context. We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z)
Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner. Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval. We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z)
LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results. We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
Automated Query Generation for Evidence Collection from Web Search Engines [2.642698101441705]
It is widely accepted that so-called facts can be checked by searching for information on the Internet. This process requires a fact-checker to formulate a search query based on the fact and to present it to a search engine. We ask the question as to whether it is possible to automate the first step, that of query generation.
arXiv Detail & Related papers (2023-03-15T14:32:00Z)
Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently. We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z)
Deep Reinforcement Agent for Efficient Instant Search [14.086339486783018]
We propose to address the load issue by identifying tokens that are semantically more salient towards retrieving relevant documents. We train a reinforcement agent that interacts directly with the search engine and learns to predict the word's importance. A novel evaluation framework is presented to study the trade-off between the number of triggered searches and the system's performance.
arXiv Detail & Related papers (2022-03-17T22:47:15Z)
Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems. We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.