Related papers: Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

URL: http://arxiv.org/abs/2305.11364v2
Date: Wed, 27 Sep 2023 22:08:13 GMT
Title: Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models
Authors: Emily Reif, Minsuk Kahng, Savvas Petridis
Abstract summary: LinguisticLens is a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of datasets. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples.
Score: 9.808214545408541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.

Related papers

Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding [6.0158981171030685]
This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations.<n>We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations.<n>The results indicate that LLMs' performance is more influenced by features like lexical overlap and sentence length than by metaphorical content.
arXiv Detail & Related papers (2025-07-21T08:09:11Z)
Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data [51.57559025799189]
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization.
arXiv Detail & Related papers (2025-01-16T13:16:37Z)
Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates. We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z)
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z)
Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a framework for learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z)
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations [13.608653575298183]
We introduce the SUGARCREPE++ dataset to analyze the sensitivity of vision-and-language models to semantic alterations. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++.
arXiv Detail & Related papers (2024-06-17T03:22:20Z)
VLSlice: Interactive Vision-and-Language Slice Discovery [17.8634551024147]
VLSlice is an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study and release the tool publicly.
arXiv Detail & Related papers (2023-09-13T04:02:38Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Explaining Patterns in Data with Language Models via Interpretable Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z)
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages. We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.