Evaluating the Robustness of Dense Retrievers in Interdisciplinary Domains
- URL: http://arxiv.org/abs/2506.21581v1
- Date: Mon, 16 Jun 2025 23:54:08 GMT
- Title: Evaluating the Robustness of Dense Retrievers in Interdisciplinary Domains
- Authors: Sarthak Chaturvedi, Anurag Acharya, Rounak Meyur, Koby Hayashi, Sai Munikoti, Sameera Horawalavithana,
- Abstract summary: Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models.<n>We show two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning.
- Score: 0.6432265982168868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models. This creates misleading assessments that influence deployment decisions in specialized domains. We show that two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning. Using environmental regulatory document retrieval as a case study, we fine-tune ColBERTv2 model on Environmental Impact Statements (EIS) from federal agencies. We evaluate these models across two benchmarks with different semantic structures. Our findings reveal that identical domain adaptation approaches show very different perceived benefits depending on evaluation methodology. On one benchmark, with clearly separated topic boundaries, domain adaptation shows small improvements (maximum 0.61% NDCG gain). However, on the other benchmark with overlapping semantic structures, the same models demonstrate large improvements (up to 2.22% NDCG gain), a 3.6-fold difference in the performance benefit. We compare these benchmarks through topic diversity metrics, finding that the higher-performing benchmark shows 11% higher average cosine distances between contexts and 23% lower silhouette scores, directly contributing to the observed performance difference. These results demonstrate that benchmark selection strongly determines assessments of retrieval system effectiveness in specialized domains. Evaluation frameworks with well-separated topics regularly underestimate domain adaptation benefits, while those with overlapping semantic boundaries reveal improvements that better reflect real-world regulatory document complexity. Our findings have important implications for developing and deploying AI systems for interdisciplinary domains that integrate multiple topics.
Related papers
- A New HOPE: Domain-agnostic Automatic Evaluation of Text Chunking [44.47350338664039]
Document chunking fundamentally impacts Retrieval-Augmented Generation (RAG)<n>There is currently no framework to analyze the impact of different chunking methods.<n>We introduce a novel methodology that defines essential characteristics of the chunking process at three levels.
arXiv Detail & Related papers (2025-05-04T16:22:27Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z) - Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation [50.31351006532924]
Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc.<n>It suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation.<n>We introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation.
arXiv Detail & Related papers (2024-12-29T17:59:45Z) - Domain penalisation for improved Out-of-Distribution Generalisation [1.979158763744267]
Domain generalisation (DG) aims to ensure robust performance across diverse and unseen target domains.
We propose a framework for the task of object detection, where the data is assumed to be sampled from multiple source domains.
By prioritising the domains that needs more attention, our approach effectively balances the training process.
arXiv Detail & Related papers (2024-08-03T11:06:47Z) - ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution
Detection in Segmentation [22.084967085509387]
We propose a dual-level OOD detection framework to handle domain shift and semantic shift jointly.
The first level distinguishes whether domain shift exists in the image by leveraging global low-level features.
The second level identifies pixels with semantic shift by utilizing dense high-level feature maps.
arXiv Detail & Related papers (2023-09-12T06:49:56Z) - Summarization from Leaderboards to Practice: Choosing A Representation
Backbone and Ensuring Robustness [21.567112955050582]
In both automatic and human evaluation, BART performs better than PEG and T5.
We find considerable variation in system output that can be captured only with human evaluation.
arXiv Detail & Related papers (2023-06-18T13:35:41Z) - Cross-Domain Policy Adaptation via Value-Guided Data Filtering [57.62692881606099]
Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning.
We present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets.
arXiv Detail & Related papers (2023-05-28T04:08:40Z) - Multi-level Consistency Learning for Semi-supervised Domain Adaptation [85.90600060675632]
Semi-supervised domain adaptation (SSDA) aims to apply knowledge learned from a fully labeled source domain to a scarcely labeled target domain.
We propose a Multi-level Consistency Learning framework for SSDA.
arXiv Detail & Related papers (2022-05-09T06:41:18Z) - Knowledge Distillation for BERT Unsupervised Domain Adaptation [2.969705152497174]
A pre-trained language model, BERT, has brought significant performance improvements across a range of natural language processing tasks.
We propose a simple but effective unsupervised domain adaptation method, adversarial adaptation with distillation (AAD)
We evaluate our approach in the task of cross-domain sentiment classification on 30 domain pairs.
arXiv Detail & Related papers (2020-10-22T06:51:24Z) - Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation [66.74638960925854]
Partial domain adaptation (PDA) deals with a realistic and challenging problem when the source domain label space substitutes the target domain.
We propose an Adaptively-Accumulated Knowledge Transfer framework (A$2$KT) to align the relevant categories across two domains.
arXiv Detail & Related papers (2020-08-27T00:53:43Z) - Cross-Domain Facial Expression Recognition: A Unified Evaluation
Benchmark and Adversarial Graph Learning [85.6386289476598]
We develop a novel adversarial graph representation adaptation (AGRA) framework for cross-domain holistic-local feature co-adaptation.
We conduct extensive and fair evaluations on several popular benchmarks and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2020-08-03T15:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.