Related papers: LMdiff: A Visual Diff Tool to Compare Language Models

Related papers

RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z)
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models [55.2480439325792]
We propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another.<n>Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs.<n>We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.
arXiv Detail & Related papers (2025-06-02T19:44:06Z)
You've Changed: Detecting Modification of Black-Box Large Language Models [4.7541096609711]
Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text.
arXiv Detail & Related papers (2025-04-14T04:16:43Z)
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models [50.34197948438868]
ThinkDiff is an alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities. We show that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation. We also demonstrate exceptional performance in composing multiple images and texts into logically coherent images.
arXiv Detail & Related papers (2025-02-12T05:30:08Z)
Model-diff: A Tool for Comparative Study of Language Models in the Input Space [34.680890752084004]
We propose a new model comparative analysis setting that considers a large input space where brute-force enumeration would be infeasible. Experiments reveal for the first time the quantitative prediction differences between LMs in a large input space, potentially facilitating the model analysis for applications such as model plagiarism.
arXiv Detail & Related papers (2024-12-13T00:06:25Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation [6.273933281069326]
We investigate three text diversity incentive methods well established in crowdsourcing: taboo words, hints by previous outlier solutions, and chaining on previous outlier solutions. We show that diversity is most increased by taboo words, but downstream model performance is highest with hints.
arXiv Detail & Related papers (2024-01-12T15:46:43Z)
diff History for Neural Language Agents [33.13471417703669]
We introduce diff history, a simple and highly effective solution to these issues. By applying the Unix diff command on consecutive text observations in the interaction histories used to prompt LM policies, we can both abstract away redundant information. On NetHack, an unsolved video game that requires long-horizon reasoning for decision-making, LMs tuned with diff history match state-of-the-art performance for neural agents.
arXiv Detail & Related papers (2023-12-12T18:59:30Z)
Describing Differences in Image Sets with Natural Language [101.80939666230168]
Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets. We introduce VisDiff, which first captions the images and prompts a language model to propose difference descriptions. We are able to find interesting and previously unknown differences in datasets and models, demonstrating VisDiff's utility in revealing nuanced insights.
arXiv Detail & Related papers (2023-12-05T18:59:16Z)
Perturbed examples reveal invariances shared by language models [8.04604449335578]
We introduce a novel framework to compare two NLP models. Via experiments on models from the same and different architecture families, this framework offers insights about how changes in models affect linguistic capabilities.
arXiv Detail & Related papers (2023-11-07T17:48:35Z)
MacLaSa: Multi-Aspect Controllable Text Generation via Efficient Sampling from Compact Latent Space [110.85888003111653]
Multi-aspect controllable text generation aims to generate fluent sentences that possess multiple desired attributes simultaneously. We introduce a novel approach for multi-aspect control, namely MacLaSa, that estimates compact latent space for multiple aspects. We show that MacLaSa outperforms several strong baselines on attribute relevance and textual quality while maintaining a high inference speed.
arXiv Detail & Related papers (2023-05-22T07:30:35Z)
Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models [9.808214545408541]
LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
arXiv Detail & Related papers (2023-05-19T00:53:45Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Multivariate Data Explanation by Jumping Emerging Patterns Visualization [78.6363825307044]
We present VAX (multiVariate dAta eXplanation), a new VA method to support the identification and visual interpretation of patterns in multivariate data sets. Unlike the existing similar approaches, VAX uses the concept of Jumping Emerging Patterns to identify and aggregate several diversified patterns, producing explanations through logic combinations of data variables.
arXiv Detail & Related papers (2021-06-21T13:49:44Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
SentenceMIM: A Latent Variable Language Model [19.39122632876056]
SentenceMIM is a probabilistic auto-encoder for language data. It is trained with Mutual Information Machine (MIM) learning to provide a fixed length representation of variable length language observations. We demonstrate the versatility of sentenceMIM by utilizing a trained model for question-answering and transfer learning.
arXiv Detail & Related papers (2020-02-18T15:34:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.