Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation
- URL: http://arxiv.org/abs/2502.13207v2
- Date: Mon, 04 Aug 2025 05:39:29 GMT
- Title: Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation
- Authors: Giorgio Franceschelli, Mirco Musolesi,
- Abstract summary: Large language models for creative tasks often lack diversity.<n>Common solutions, such as sampling at higher temperatures, can compromise the quality of the results.<n>We propose a context-based score to quantitatively evaluate value and originality.
- Score: 2.4555276449137042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Dealing with this trade-off is still an open challenge in designing AI systems for creativity. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We show that our score can be used as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments considering a variety of creative tasks, such as poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated solutions.
Related papers
- Probing and Inducing Combinational Creativity in Vision-Language Models [52.76981145923602]
Recent advances in Vision-Language Models (VLMs) have sparked debate about whether their outputs reflect combinational creativity.<n>We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels.<n>To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework.
arXiv Detail & Related papers (2025-04-17T17:38:18Z) - Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models [19.700493685081604]
Large language models (LLMs) are increasingly used for ideation and scientific discovery.
Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality.
We propose a new novelty metric for LLM generations that balances originality and quality.
arXiv Detail & Related papers (2025-04-13T00:48:58Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - Steering Large Language Models to Evaluate and Amplify Creativity [7.031631627161492]
We show that we can leverage this knowledge of how to write creatively in order to better judge what is creative.<n>We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively"
arXiv Detail & Related papers (2024-12-08T20:28:48Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Creativity Has Left the Chat: The Price of Debiasing Language Models [1.223779595809275]
We investigate the unintended consequences of Reinforcement Learning from Human Feedback on the creativity of Large Language Models (LLMs)
Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation.
arXiv Detail & Related papers (2024-06-08T22:14:51Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - The Creative Frontier of Generative AI: Managing the Novelty-Usefulness
Tradeoff [0.4873362301533825]
We explore the optimal balance between novelty and usefulness in generative Artificial Intelligence (AI) systems.
Overemphasizing either aspect can lead to limitations such as hallucinations and memorization.
arXiv Detail & Related papers (2023-06-06T11:44:57Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Towards Creativity Characterization of Generative Models via Group-based
Subset Scanning [64.6217849133164]
We propose group-based subset scanning to identify, quantify, and characterize creative processes.
We find that creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets.
arXiv Detail & Related papers (2022-03-01T15:07:14Z) - A Contrastive Framework for Neural Text Generation [46.845997620234265]
We show that an underlying reason for model degeneration is the anisotropic distribution of token representations.
We present a contrastive solution: (i) SimCTG, a contrastive training objective to calibrate the model's representation space, and (ii) a decoding method -- contrastive search -- to encourage diversity while maintaining coherence in the generated text.
arXiv Detail & Related papers (2022-02-13T21:46:14Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Language Model Evaluation in Open-ended Text Generation [0.76146285961466]
We study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text.
From there, we propose a practical pipeline to evaluate language models in open-ended generation task.
arXiv Detail & Related papers (2021-08-08T06:16:02Z) - Towards creativity characterization of generative models via group-based
subset scanning [51.84144826134919]
We propose group-based subset scanning to quantify, detect, and characterize creative processes.
Creative samples generate larger subsets of anomalies than normal or non-creative samples across datasets.
arXiv Detail & Related papers (2021-04-01T14:07:49Z) - Knowledge-driven Data Construction for Zero-shot Evaluation in
Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks.
We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks.
We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z) - Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query.
A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives.
We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z) - Informed Sampling for Diversity in Concept-to-Text NLG [8.883733362171034]
We propose an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce.
Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output.
arXiv Detail & Related papers (2020-04-29T17:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.