Related papers: Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Related papers

Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models [6.036586911740041]
Large language models (LLMs) are increasingly used in verbal creative tasks.<n>The widely used Divergent Association Task ( DAT) focuses on novelty, ignoring appropriateness.<n>We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities.
arXiv Detail & Related papers (2026-01-28T12:41:32Z)
Evaluation Framework for AI Creativity: A Case Study Based on Story Generation [5.536493649574258]
evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity.<n>We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components.<n>Using controlled story generation via Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments.
arXiv Detail & Related papers (2026-01-07T08:31:08Z)
Training Emergent Joint Associations: A Reinforcement Learning Approach to Creative Thinking in Language Models [9.943285575387849]
Associative thinking is a foundational element of human creativity and problem-solving.<n>This paper explores whether reinforcement learning guided by associative thinking principles can enhance a model's performance across diverse generative tasks.
arXiv Detail & Related papers (2025-11-22T02:10:27Z)
CreativityPrism: A Holistic Benchmark for Large Language Model Creativity [64.18257552903151]
Creativity is often seen as a hallmark of human intelligence.<n>There is still no holistic framework to evaluate their creativity across diverse scenarios.<n>We propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity.
arXiv Detail & Related papers (2025-10-23T00:22:10Z)
DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay [50.31585196187091]
We introduce DixitWorld, a comprehensive evaluation suite designed to deconstruct multimodal abductive reasoning.<n>DIXITWORLD features two core components: DixitArena, a dynamic, multi-agent environment that evaluates hypothesis generation and hypothesis selection.<n>Results from DixitArena reveal distinct, role-dependent behaviors.
arXiv Detail & Related papers (2025-10-11T08:48:48Z)
Jointly Reinforcing Diversity and Quality in Language Model Generations [64.72289248044514]
Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity.<n>We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimize for response quality and semantic diversity.
arXiv Detail & Related papers (2025-09-02T17:38:47Z)
Probing and Inducing Combinational Creativity in Vision-Language Models [52.76981145923602]
Recent advances in Vision-Language Models (VLMs) have sparked debate about whether their outputs reflect combinational creativity.<n>We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels.<n>To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework.
arXiv Detail & Related papers (2025-04-17T17:38:18Z)
Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models [19.700493685081604]
Large language models (LLMs) are increasingly used for ideation and scientific discovery. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. We propose a new novelty metric for LLM generations that balances originality and quality.
arXiv Detail & Related papers (2025-04-13T00:48:58Z)
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z)
Steering Large Language Models to Evaluate and Amplify Creativity [7.031631627161492]
We show that we can leverage this knowledge of how to write creatively in order to better judge what is creative.<n>We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively"
arXiv Detail & Related papers (2024-12-08T20:28:48Z)
Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z)
Creativity Has Left the Chat: The Price of Debiasing Language Models [1.223779595809275]
We investigate the unintended consequences of Reinforcement Learning from Human Feedback on the creativity of Large Language Models (LLMs) Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation.
arXiv Detail & Related papers (2024-06-08T22:14:51Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
The Creative Frontier of Generative AI: Managing the Novelty-Usefulness Tradeoff [0.4873362301533825]
We explore the optimal balance between novelty and usefulness in generative Artificial Intelligence (AI) systems. Overemphasizing either aspect can lead to limitations such as hallucinations and memorization.
arXiv Detail & Related papers (2023-06-06T11:44:57Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
Towards Creativity Characterization of Generative Models via Group-based Subset Scanning [64.6217849133164]
We propose group-based subset scanning to identify, quantify, and characterize creative processes. We find that creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets.
arXiv Detail & Related papers (2022-03-01T15:07:14Z)
A Contrastive Framework for Neural Text Generation [46.845997620234265]
We show that an underlying reason for model degeneration is the anisotropic distribution of token representations. We present a contrastive solution: (i) SimCTG, a contrastive training objective to calibrate the model's representation space, and (ii) a decoding method -- contrastive search -- to encourage diversity while maintaining coherence in the generated text.
arXiv Detail & Related papers (2022-02-13T21:46:14Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)
Language Model Evaluation in Open-ended Text Generation [0.76146285961466]
We study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text. From there, we propose a practical pipeline to evaluate language models in open-ended generation task.
arXiv Detail & Related papers (2021-08-08T06:16:02Z)
Towards creativity characterization of generative models via group-based subset scanning [51.84144826134919]
We propose group-based subset scanning to quantify, detect, and characterize creative processes. Creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets.
arXiv Detail & Related papers (2021-04-01T14:07:49Z)
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks. We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks. We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z)
Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query. A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives. We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z)
Informed Sampling for Diversity in Concept-to-Text NLG [8.883733362171034]
We propose an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output.
arXiv Detail & Related papers (2020-04-29T17:43:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.