FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction
- URL: http://arxiv.org/abs/2510.20926v1
- Date: Thu, 23 Oct 2025 18:30:19 GMT
- Title: FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction
- Authors: Natasha Johnson, Amanda Bertsch, Maria-Emil Deal, Emma Strubell,
- Abstract summary: We release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata.<n>We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories.
- Score: 11.216252240451183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As language models become capable of processing increasingly long and complex texts, there has been growing interest in their application within computational literary studies. However, evaluating the usefulness of these models for such tasks remains challenging due to the cost of fine-grained annotation for long-form texts and the data contamination concerns inherent in using public-domain literature. Current embedding similarity datasets are not suitable for evaluating literary-domain tasks because of a focus on coarse-grained similarity and primarily on very short text. We assemble and release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata and validated by digital humanities scholars. We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories that would be useful for computational literary studies tasks. Throughout our data-collection process, we prioritize author agency and rely on continual, informed author consent.
Related papers
- Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z) - Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes [6.076874513889027]
Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers.<n>This paper presents our work in evaluating different embeddings through a comprehensive comparative analysis of four distinct models.<n>We employ both K-Nearest Neighbors (KNN) and Logistic Regression (LR) to perform binary classification tasks, specifically determining whether a text snippet is associated with 'delay' or 'not delay' within a labeled dataset.
arXiv Detail & Related papers (2025-01-16T22:12:11Z) - NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - Visual Analytics for Fine-grained Text Classification Models and Datasets [3.6873612681664016]
SemLa is a novel visual analytics system tailored for fine-grained text classification.
This paper details the iterative design study and the resulting innovations featured in SemLa.
arXiv Detail & Related papers (2024-03-21T17:26:28Z) - U-DIADS-Bib: a full and few-shot pixel-precise dataset for document
layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities.
We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z) - A Novel Multidimensional Reference Model For Heterogeneous Textual
Datasets Using Context, Semantic And Syntactic Clues [4.453735522794044]
This study aims to produce a novel multidimensional reference model using categories for heterogeneous datasets.
The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence.
arXiv Detail & Related papers (2023-11-10T17:02:25Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - PART: Pre-trained Authorship Representation Transformer [52.623051272843426]
Authors writing documents imprint identifying information within their texts.<n>Previous works use hand-crafted features or classification tasks to train their authorship models.<n>We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.