Related papers: Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models

Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models

URL: http://arxiv.org/abs/2509.14399v1
Date: Wed, 17 Sep 2025 20:01:54 GMT
Title: Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models
Authors: Gaifan Zhang, Yi Zhou, Danushka Bollegala,
Abstract summary: Deshpande et al. (2023) proposed the Conditional Semantic Textual Similarity (C-STS) task.<n>We re-annotate a large training dataset for the C-STS task with minimal manual effort.<n>By training a supervised C-STS model on our cleaned and re-annotated dataset, we achieve a 5.4% statistically significant improvement in Spearman correlation.
Score: 24.298406471983558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic similarity between two sentences depends on the aspects considered between those sentences. To study this phenomenon, Deshpande et al. (2023) proposed the Conditional Semantic Textual Similarity (C-STS) task and annotated a human-rated similarity dataset containing pairs of sentences compared under two different conditions. However, Tu et al. (2024) found various annotation issues in this dataset and showed that manually re-annotating a small portion of it leads to more accurate C-STS models. Despite these pioneering efforts, the lack of large and accurately annotated C-STS datasets remains a blocker for making progress on this task as evidenced by the subpar performance of the C-STS models. To address this training data need, we resort to Large Language Models (LLMs) to correct the condition statements and similarity ratings in the original dataset proposed by Deshpande et al. (2023). Our proposed method is able to re-annotate a large training dataset for the C-STS task with minimal manual effort. Importantly, by training a supervised C-STS model on our cleaned and re-annotated dataset, we achieve a 5.4% statistically significant improvement in Spearman correlation. The re-annotated dataset is available at https://LivNLP.github.io/CSTS-reannotation.

Related papers

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training [70.84726713548099]
SAE-based Transferability Score (STS) is a new metric to forecast post-training transferability.<n>We show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes.
arXiv Detail & Related papers (2026-03-03T12:01:09Z)
ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning [0.36332383102551763]
The DimABSA task requires fine-grained sentiment intensity prediction for restaurant reviews. We propose a Coarse-to-Fine In-context Learning method based on the Baichuan2-7B model for the DimABSA task.
arXiv Detail & Related papers (2024-07-22T02:54:46Z)
Advancing Semantic Textual Similarity Modeling: A Regression Framework with Translated ReLU and Smooth K2 Loss [3.435381469869212]
This paper presents an innovative regression framework for Sentence-BERT STS tasks. It proposes two simple yet effective loss functions: Translated ReLU and Smooth K2 Loss. Experimental results demonstrate that our method achieves convincing performance across seven established STS benchmarks.
arXiv Detail & Related papers (2024-06-08T02:52:43Z)
Linguistically Conditioned Semantic Textual Similarity [6.049872961766425]
We reannotate the C-STS validation set and observe annotator discrepancy on 55% of the instances resulting from the annotation errors in the original label. We present an automatic error identification pipeline that is able to identify annotation errors from the CSTS data with over 80% F1 score. We propose a new method that largely improves the performance over baselines on the C-STS data by training the models with the answers.
arXiv Detail & Related papers (2024-06-06T01:23:45Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Data Similarity is Not Enough to Explain Language Model Performance [6.364065652816667]
Similarity measures correlate with language model performance. Similarity metrics are not correlated with accuracy or even each other. This suggests that the relationship between pretraining data and downstream tasks is more complex than often assumed.
arXiv Detail & Related papers (2023-11-15T14:48:08Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
C-STS: Conditional Semantic Textual Similarity [70.09137422955506]
We propose a novel task called Conditional STS (C-STS) It measures sentences' similarity conditioned on a feature described in natural language (hereon, condition) C-STS's advantages are two-fold: it reduces the subjectivity and ambiguity of STS and enables fine-grained language model evaluation through diverse natural language conditions.
arXiv Detail & Related papers (2023-05-24T12:18:50Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.