Enhancing Semantic Document Retrieval- Employing Group Steiner Tree Algorithm with Domain Knowledge Enrichment
- URL: http://arxiv.org/abs/2508.20543v1
- Date: Thu, 28 Aug 2025 08:29:55 GMT
- Title: Enhancing Semantic Document Retrieval- Employing Group Steiner Tree Algorithm with Domain Knowledge Enrichment
- Authors: Apurva Kulkarni, Chandrashekar Ramanathan, Vinu E Venugopal,
- Abstract summary: This research focuses on the development of a versatile algorithm- 'Semantic-based Concept Retrieval using Group Steiner Tree'<n>The proposed algorithm incorporates domain information to enhance semantic-aware knowledge representation and data access.<n>To assess the effectiveness of the SemDR system, research work conducts performance evaluations using a benchmark consisting of 170 real-world search queries.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieving pertinent documents from various data sources with diverse characteristics poses a significant challenge for Document Retrieval Systems. The complexity of this challenge is further compounded when accounting for the semantic relationship between data and domain knowledge. While existing retrieval systems using semantics (usually represented as Knowledge Graphs created from open-access resources and generic domain knowledge) hold promise in delivering relevant outcomes, their precision may be compromised due to the absence of domain-specific information and reliance on outdated knowledge sources. In this research, the primary focus is on two key contributions- a) the development of a versatile algorithm- 'Semantic-based Concept Retrieval using Group Steiner Tree' that incorporates domain information to enhance semantic-aware knowledge representation and data access, and b) the practical implementation of the proposed algorithm within a document retrieval system using real-world data. To assess the effectiveness of the SemDR system, research work conducts performance evaluations using a benchmark consisting of 170 real-world search queries. Rigorous evaluation and verification by domain experts are conducted to ensure the validity and accuracy of the results. The experimental findings demonstrate substantial advancements when compared to the baseline systems, with precision and accuracy achieving levels of 90% and 82% respectively, signifying promising improvements.
Related papers
- Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms [1.7842332554022695]
This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking.<n>It implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker.<n>Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach.
arXiv Detail & Related papers (2025-12-05T18:59:18Z) - A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM-Generated Metadata to Enhance RAG Systems [0.0]
This research presents a systematic framework for metadata enrichment using large language models (LLMs) to enhance document retrieval in Retrieval-Augmented Generation (RAG) systems.<n>Our approach employs a comprehensive, structured pipeline that dynamically generates meaningful metadata for document segments.
arXiv Detail & Related papers (2025-12-05T04:05:06Z) - Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z) - WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents [72.28593628378991]
WebResearcher is an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process.<n>WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
arXiv Detail & Related papers (2025-09-16T17:57:17Z) - Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG [51.120170062795566]
We propose Divide-Then-Align (DTA) to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary.<n>DTA balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-27T08:21:21Z) - Semantic Synergy: Unlocking Policy Insights and Learning Pathways Through Advanced Skill Mapping [0.0]
This research introduces a comprehensive system based on state-of-the-art natural language processing, semantic embedding, and efficient search techniques.<n>The system automatically extracts and aggregates normalized competencies from multiple documents.<n>It creates strong relationships between recognized competencies, occupation profiles, and related learning courses.
arXiv Detail & Related papers (2025-03-13T06:41:26Z) - Enhancing Data Integrity through Provenance Tracking in Semantic Web Frameworks [1.3597551064547502]
SURROUND Australia Pty Ltd demonstrates innovative applica-tions of the PROV Data Model (PROV-DM) and its Semantic Web variant, PROV-O.<n>The paper highlights the company's architecture for capturing comprehensive provenance data, en-abling robust validation, traceability, and knowledge inference.
arXiv Detail & Related papers (2025-01-12T16:13:27Z) - Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods [0.0]
In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies.
A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs.
The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy and relevance.
arXiv Detail & Related papers (2024-09-13T02:08:47Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [52.61625841028781]
COIR (Code Information Retrieval Benchmark) is a robust and comprehensive benchmark designed to assess code retrieval capabilities.<n>COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains.<n>We evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based
Baseline [95.88825497452716]
Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems.
GREW is the first large-scale dataset for gait recognition in the wild.
SPOSGait is the first NAS-based gait recognition model.
arXiv Detail & Related papers (2022-05-05T14:57:39Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Improving Named Entity Recognition with Attentive Ensemble of Syntactic
Information [36.03316058182617]
Named entity recognition (NER) is highly sensitive to sentential syntactic and semantic properties.
In this paper, we improve NER by leveraging different types of syntactic information through attentive ensemble.
Experimental results on six English and Chinese benchmark datasets suggest the effectiveness of the proposed model.
arXiv Detail & Related papers (2020-10-29T10:25:17Z) - Heterogeneous Network Representation Learning: A Unified Framework with
Survey and Benchmark [57.10850350508929]
We aim to provide a unified framework to summarize and evaluate existing research on heterogeneous network embedding (HNE)
As the first contribution, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms.
As the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etcfrom different sources.
As the third contribution, we create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings.
arXiv Detail & Related papers (2020-04-01T03:42:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.