JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
- URL: http://arxiv.org/abs/2504.20849v1
- Date: Tue, 29 Apr 2025 15:19:06 GMT
- Title: JaccDiv: A Metric and Benchmark for Quantifying Diversity of Generated Marketing Text in the Music Industry
- Authors: Anum Afzal, Alexandre Mercier, Florian Matthes,
- Abstract summary: This paper investigates data-to-text approaches to automatically generate marketing texts.<n>We leverage Language Models such as T5, GPT-3.5, GPT-4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches.<n>This research extends its relevance beyond the music industry, proving beneficial in various fields.
- Score: 47.76073133338117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online platforms are increasingly interested in using Data-to-Text technologies to generate content and help their users. Unfortunately, traditional generative methods often fall into repetitive patterns, resulting in monotonous galleries of texts after only a few iterations. In this paper, we investigate LLM-based data-to-text approaches to automatically generate marketing texts that are of sufficient quality and diverse enough for broad adoption. We leverage Language Models such as T5, GPT-3.5, GPT-4, and LLaMa2 in conjunction with fine-tuning, few-shot, and zero-shot approaches to set a baseline for diverse marketing texts. We also introduce a metric JaccDiv to evaluate the diversity of a set of texts. This research extends its relevance beyond the music industry, proving beneficial in various fields where repetitive automated content generation is prevalent.
Related papers
- Robust and Fine-Grained Detection of AI Generated Texts [0.29569362468768806]
Existing systems often struggle with accurately identifying AI-generated content over shorter texts.<n>Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts.<n>We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages.
arXiv Detail & Related papers (2025-04-16T10:29:30Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark [0.0]
We provide an overview of the advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB)
Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
arXiv Detail & Related papers (2024-05-27T09:52:54Z) - Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation [5.3558730908641525]
We propose a first benchmark dataset, CAMERA, to standardize the task of ATG.
Our experiments show the current state and the remaining challenges.
We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
arXiv Detail & Related papers (2023-09-21T12:51:24Z) - A Benchmark for Text Expansion: Datasets, Metrics, and Baselines [87.47745669317894]
This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifier into proper locations of the plain text.
We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references.
On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines.
arXiv Detail & Related papers (2023-09-17T07:54:38Z) - Teach LLMs to Personalize -- An Approach inspired by Writing Education [37.198598706659524]
We propose a general approach for personalized text generation using large language models (LLMs)
Inspired by the practice of writing education, we develop a multistage and multitask framework to teach LLMs for personalized generation.
arXiv Detail & Related papers (2023-08-15T18:06:23Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data.
Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs)
Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Text Data Augmentation: Towards better detection of spear-phishing
emails [1.6556358263455926]
We propose a corpus and task augmentation framework to augment English texts within our company.
Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and agnostics.
We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora.
arXiv Detail & Related papers (2020-07-04T07:45:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.