AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising
- URL: http://arxiv.org/abs/2408.05906v2
- Date: Sun, 27 Apr 2025 12:36:52 GMT
- Title: AdTEC: A Unified Benchmark for Evaluating Text Quality in Search Engine Advertising
- Authors: Peinan Zhang, Yusuke Sakai, Masato Mita, Hiroki Ouchi, Taro Watanabe,
- Abstract summary: We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives.<n>Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house.
- Score: 19.642481233488667
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the increase in the fluency of ad texts automatically created by natural language generation technology, there is high demand to verify the quality of these creatives in a real-world setting. We propose AdTEC (Ad Text Evaluation Benchmark by CyberAgent), the first public benchmark to evaluate ad texts from multiple perspectives within practical advertising operations. Our contributions are as follows: (i) Defining five tasks for evaluating the quality of ad texts, as well as building a Japanese dataset based on the practical operational experiences of building a Japanese dataset based on the practical operational experiences of advertising agencies, which are typically kept in-house. (ii) Validating the performance of existing pre-trained language models (PLMs) and human evaluators on the dataset. (iii) Analyzing the characteristics and providing challenges of the benchmark. The results show that while PLMs have already reached practical usage level in several tasks, humans still outperform in certain domains, implying that there is significant room for improvement in this area.
Related papers
- Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.
Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z) - Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models [24.481028155002523]
We present Zero-shot Benchmarking (ZSB), a framework for creating high-quality benchmarks for any task.
ZSB is simple and flexible: it requires only the creation of a prompt for data generation and one for evaluation.
It is scalable to tasks and languages where collecting real-world data is costly or impractical.
arXiv Detail & Related papers (2025-04-01T17:40:08Z) - Optimizing Multi-Stage Language Models for Effective Text Retrieval [0.0]
We introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets.
Our method leverages advanced language models to achieve state-of-the-art performance.
To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies.
arXiv Detail & Related papers (2024-12-26T16:05:19Z) - Benchmarking pre-trained text embedding models in aligning built asset information [0.0]
This study presents a comparative benchmark of state-of-the-art text embedding models to evaluate their effectiveness in aligning built asset information with domain-specific technical concepts.
The results of our benchmarking across six proposed datasets, covering three tasks of clustering, retrieval, and reranking, highlight the need for future research on domain adaptation techniques.
arXiv Detail & Related papers (2024-11-18T20:54:17Z) - What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks.
We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual.
We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z) - Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark [0.0]
We provide an overview of the advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB)
Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
arXiv Detail & Related papers (2024-05-27T09:52:54Z) - AutoAD III: The Prequel -- Back to the Pixels [96.27059234129788]
We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these.
We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models.
We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance.
arXiv Detail & Related papers (2024-04-22T17:59:57Z) - A Systematic Review of Data-to-Text NLG [2.4769539696439677]
Methods for producing high-quality text are explored, addressing the challenge of hallucinations in data-to-text generation.
Despite advancements in text quality, the review emphasizes the importance of research in low-resourced languages.
arXiv Detail & Related papers (2024-02-13T14:51:45Z) - TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction.
TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains.
We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z) - Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation [5.3558730908641525]
We propose a first benchmark dataset, CAMERA, to standardize the task of ATG.
Our experiments show the current state and the remaining challenges.
We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
arXiv Detail & Related papers (2023-09-21T12:51:24Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - A Unified Knowledge Graph Augmentation Service for Boosting
Domain-specific NLP Tasks [10.28161912127425]
We propose KnowledgeDA, a unified domain language model development service to enhance the task-specific training procedure with domain knowledge graphs.
We implement a prototype of KnowledgeDA to learn language models for two domains, healthcare and software development.
arXiv Detail & Related papers (2022-12-10T09:18:43Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z) - K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for
Question Answering [8.772466918885224]
We propose K-AID, a systematic approach that includes a low-cost knowledge acquisition process for acquiring domain knowledge.
Instead of capturing entity knowledge like the majority of existing K-PLMs, our approach captures relational knowledge.
We conducted experiments on five text classification tasks and three text matching tasks from three domains, namely E-commerce, Government, and Film&TV, and performed online A/B tests in E-commerce.
arXiv Detail & Related papers (2021-09-22T07:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.