Related papers: Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services

Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services

URL: http://arxiv.org/abs/2412.17075v1
Date: Sun, 22 Dec 2024 15:57:35 GMT
Title: Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services
Authors: Elham Peimani, Gurpreet Singh, Nisarg Mahyavanshi, Aman Arora, Awais Shaikh,
Abstract summary: Retrieving semantically relevant documents in niche domains poses significant challenges for TF-IDF-based systems.<n>This paper introduces an iterative and semi-automated query refinement methodology tailored to Humber College's career services webpages.
Score: 0.13980986259786224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieving semantically relevant documents in niche domains poses significant challenges for traditional TF-IDF-based systems, often resulting in low similarity scores and suboptimal retrieval performance. This paper addresses these challenges by introducing an iterative and semi-automated query refinement methodology tailored to Humber College's career services webpages. Initially, generic queries related to interview preparation yield low top-document similarities (approximately 0.2--0.3). To enhance retrieval effectiveness, we implement a two-fold approach: first, domain-aware query refinement by incorporating specialized terms such as resources-online-learning, student-online-services, and career-advising; second, the integration of structured educational descriptors like "online resume and interview improvement tools." Additionally, we automate the extraction of domain-specific keywords from top-ranked documents to suggest relevant terms for query expansion. Through experiments conducted on five baseline queries, our semi-automated iterative refinement process elevates the average top similarity score from approximately 0.18 to 0.42, marking a substantial improvement in retrieval performance. The implementation details, including reproducible code and experimental setups, are made available in our GitHub repositories \url{https://github.com/Elipei88/HumberChatbotBackend} and \url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the limitations of our approach and propose future directions, including the integration of advanced neural retrieval models.

Related papers

Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles [51.0691253204425]
We introduce a retrieval approach leveraging Support Vector Regression ensembles, bootstrap aggregation (bagging), and embedding spaces on the German dataset for Legal Information Retrieval (GerDaLIR) We show improved recall over the baselines using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models.
arXiv Detail & Related papers (2025-01-09T07:21:44Z)
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language [4.5224851085910585]
Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages.<n>This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language.
arXiv Detail & Related papers (2024-12-13T09:47:26Z)
Quam: Adaptive Retrieval through Query Affinity Modelling [15.3583908068962]
Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. We propose a unifying view of the nascent area of adaptive retrieval by proposing, Quam. Our proposed approach, Quam improves the recall performance by up to 26% over the standard re-ranking baselines.
arXiv Detail & Related papers (2024-10-26T22:52:12Z)
An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources. We propose a data processing framework that integrates a Processing Module and an Analyzing Module. The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z)
Learning to Retrieve for Job Matching [22.007634436648427]
We discuss applying learning-to-retrieve technology to enhance LinkedIns job search and recommendation systems. We leverage confirmed hire data to construct a graph that evaluates a seeker's qualification for a job, and utilize learned links for retrieval. In addition to a solution based on a conventional inverted index, we developed an on-GPU solution capable of supporting both KNN and term matching efficiently.
arXiv Detail & Related papers (2024-02-21T00:05:25Z)
Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z)
How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales. We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z)
Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant. To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z)
Mining Implicit Relevance Feedback from User Behavior for Web Question Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance. Our approach significantly improves the accuracy of passage ranking without extra human labeled data. In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z)
Leveraging Cognitive Search Patterns to Enhance Automated Natural Language Retrieval Performance [0.0]
We show that cognitive reformulation patterns that mimic user search behaviour are highlighted. We formalize the application of these patterns by considering a query conceptual representation. A genetic algorithm-based weighting process allows placing emphasis on terms according to their conceptual role-type.
arXiv Detail & Related papers (2020-04-21T14:13:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.