Related papers: Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

URL: http://arxiv.org/abs/2510.02326v1
Date: Thu, 25 Sep 2025 21:35:46 GMT
Title: Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval
Authors: Vivek Bhavsar, Joseph Ereifej, Aravanan Gurusami,
Abstract summary: RA-FSM is a GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge.<n>The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed.<n>We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.

Related papers

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System [4.222675210976564]
Polymer literature contains a large and growing body of experimental knowledge.<n>Much of it is buried in unstructured text and inconsistent terminology.<n>Existing tools typically extract narrow, study-specific facts in isolation.
arXiv Detail & Related papers (2026-02-18T17:46:09Z)
AnalyticsGPT: An LLM Workflow for Scientometric Question Answering [1.5658704610960574]
AnalyticsGPT is an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering.<n>This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering.
arXiv Detail & Related papers (2026-02-10T14:23:55Z)
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents [53.03492387564392]
We introduce FS-Researcher, a file-system-based framework that scales deep research beyond the context window via a persistent workspace.<n>A Context Builder agent browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length.<n>A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts.
arXiv Detail & Related papers (2026-02-02T03:00:19Z)
DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z)
MARVEL: A Multi Agent-based Research Validator and Enabler using Large Language Models [2.0725712989738994]
We present MARVEL, a framework for domain-aware question answering and assisted scientific research.<n>MARVEL combines a fast path for straightforward queries with a more deliberate DeepSearch mode that integrates retrieval-augmented generation and Monte Carlo Tree Search.<n>We applied this framework in the context of gravitational-wave research related to the Laser Interferometer Gravitational-wave Observatory.
arXiv Detail & Related papers (2026-01-06T21:47:22Z)
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment [63.662126457336534]
OpenNovelty is an agentic system for transparent, evidence-based novelty analysis.<n>It grounds all assessments in retrieved real papers, ensuring verifiable judgments.<n>OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
arXiv Detail & Related papers (2026-01-04T15:48:51Z)
FeClustRE: Hierarchical Clustering and Semantic Tagging of App Features from User Reviews [0.0]
FeClustRE is a framework integrating hybrid feature extraction, hierarchical clustering with auto-tuning and semantic labelling.<n>We evaluate FeClustRE on public benchmarks for extraction correctness and on a sample study of generative AI assistant app reviews for clustering quality, semantic coherence, and interpretability.
arXiv Detail & Related papers (2025-10-21T16:54:21Z)
Exploratory Semantic Reliability Analysis of Wind Turbine Maintenance Logs using Large Language Models [0.0]
This paper addresses the gap in leveraging modern large language models (LLMs) for more complex reasoning tasks.<n>We introduce an exploratory framework that uses LLMs to move beyond classification and perform semantic analysis.<n>The results demonstrate that LLMs can function as powerful "reliability co-pilots," moving beyond labelling to synthesise textual information and actionable, expert-level hypotheses.
arXiv Detail & Related papers (2025-09-26T14:00:20Z)
Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate [0.19676943624884313]
Hallucinations in Large Language Model (LLM) outputs for Question Answering tasks critically undermine their real-world reliability.<n>This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access.<n>Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding.
arXiv Detail & Related papers (2025-09-01T13:34:21Z)
Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [60.83579255387347]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z)
Causal Retrieval with Semantic Consideration [6.967392207053045]
We propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations.<n>Our experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks.<n>We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.
arXiv Detail & Related papers (2025-04-07T03:04:31Z)
Improving Retrieval in Theme-specific Applications using a Corpus Topical Taxonomy [52.426623750562335]
We introduce ToTER (Topical taxonomy Enhanced Retrieval) framework. ToTER identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers.
arXiv Detail & Related papers (2024-03-07T02:34:54Z)
Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented Generation in Niche Domains, Exemplified by Korean Medicine [5.120567378386615]
We propose a natural language prompt-based retrieval augmented generation (Prompt-RAG) to enhance the performance of generative large language models (LLMs) in niche domains. We compare vector embeddings from Korean Medicine (KM) and Conventional Medicine (CM) documents, finding that KM document embeddings correlated more with token overlaps and less with human-assessed document relatedness. Results showed that Prompt-RAG outperformed existing models, including ChatGPT and conventional vector embedding-based RAGs, in terms of relevance and informativeness.
arXiv Detail & Related papers (2024-01-20T14:59:43Z)
Building Interpretable and Reliable Open Information Retriever for New Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA) We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query. We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.