Related papers: Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

URL: http://arxiv.org/abs/2506.08592v2
Date: Tue, 26 Aug 2025 03:31:26 GMT
Title: Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings
Authors: Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou,
Abstract summary: This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics.<n>We introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms.<n>We finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model.
Score: 65.31723739561151
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

Related papers

Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation [15.439883888976464]
We propose a class-driven scribble promotion network for robust scribble-supervised semantic segmentation.<n>Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels.<n>Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches.
arXiv Detail & Related papers (2025-03-18T04:43:07Z)
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric. Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z)
PRIME: Prioritizing Interpretability in Failure Mode Extraction [49.93565079216376]
We study the challenge of providing human-understandable descriptions for failure modes in trained image classification models. We propose a novel approach that prioritizes interpretability in this problem. Our method successfully identifies failure modes and generates high-quality text descriptions associated with them.
arXiv Detail & Related papers (2023-09-29T22:00:12Z)
Conjunct Resolution in the Face of Verbal Omissions [51.220650412095665]
We propose a conjunct resolution task that operates directly on the text and makes use of a split-and-rephrase paradigm in order to recover the missing elements in the coordination structure. We curate a large dataset, containing over 10K examples of naturally-occurring verbal omissions with crowd-sourced annotations. We train various neural baselines for this task, and show that while our best method obtains decent performance, it leaves ample space for improvement.
arXiv Detail & Related papers (2023-05-26T08:44:02Z)
Collaborative Auto-encoding for Blind Image Quality Assessment [17.081262827258943]
Blind image quality assessment (BIQA) is a challenging problem with important real-world applications. Recent efforts attempting to exploit powerful representations by deep neural networks (DNN) are hindered by the lack of subjectively annotated data. This paper presents a novel BIQA method which overcomes this fundamental obstacle.
arXiv Detail & Related papers (2023-05-24T03:45:03Z)
Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models. To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z)
Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration. CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root. We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)
ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z)
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning [25.88974494276895]
This work demonstrates how off-the-shelf, large-scale, image-to-text and text-to-image models can be leveraged to automatically find failures. In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs.
arXiv Detail & Related papers (2022-08-18T13:49:10Z)
A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation [50.55448707570669]
We propose a novel token-level, reference-free hallucination detection task and an associated annotated dataset named HaDes. To create this dataset, we first perturb a large number of text segments extracted from English language Wikipedia, and then verify these with crowd-sourced annotations.
arXiv Detail & Related papers (2021-04-18T04:09:48Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.