PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases
- URL: http://arxiv.org/abs/2511.11141v1
- Date: Fri, 14 Nov 2025 10:19:04 GMT
- Title: PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases
- Authors: Udo Schlegel, Franziska Weeber, Jian Lan, Thomas Seidl,
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training.<n>While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored.<n>In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries.
- Score: 9.398106516502477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.
Related papers
- Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark [27.134554623769898]
The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware large language models (MLLMs)<n>We identified critical and benchmark-quality issues that hinder fair and consistent quantitative evaluations.
arXiv Detail & Related papers (2025-07-17T17:33:11Z) - Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective [24.54292750583169]
Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications.<n>We propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs.<n>We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities.
arXiv Detail & Related papers (2025-06-23T18:31:22Z) - Large Language Models Often Say One Thing and Do Another [49.22262396351797]
We develop a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT)<n>The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains.<n>The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains.
arXiv Detail & Related papers (2025-03-10T07:34:54Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Factors [7.312170216336085]
We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic factors.<n>We find that demographic-based paraphrasing significantly impacts the performance of language models.
arXiv Detail & Related papers (2025-01-14T17:50:06Z) - RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs [42.38066214464341]
We introduce RACQUET, a dataset targeting distinct aspects of ambiguity in image-based question answering.<n>We reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses.<n>Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.
arXiv Detail & Related papers (2024-12-18T13:25:11Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.<n>We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Cross-modality debiasing: using language to mitigate sub-population shifts in imaging [28.88097536026781]
Sub-population shift accounts for a significant source of algorithmic bias and calls for distributional robustness.
Recent studies found inherent distributional robustness in multi-modality foundation models, such as the vision-language model CLIP.
We propose leveraging natural language inputs to debias the image feature representations, to improve worst-case performance on sub-populations.
arXiv Detail & Related papers (2024-02-02T18:54:48Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.