Related papers: Auto-survey Challenge

Auto-survey Challenge

URL: http://arxiv.org/abs/2310.04480v2
Date: Tue, 10 Oct 2023 09:31:04 GMT
Title: Auto-survey Challenge
Authors: Thanh Gia Hieu Khuong (TAU, LISN), Benedictus Kent Rachmat (TAU, LISN)
Abstract summary: We present a novel platform for evaluating the capability of Large Language Models (LLMs) to autonomously compose and critique survey papers. Within this framework, we organized a competition for the AutoML conference 2023. Entrants are tasked with presenting stand-alone models adept at authoring articles from designated prompts and subsequently appraising them.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel platform for evaluating the capability of Large Language Models (LLMs) to autonomously compose and critique survey papers spanning a vast array of disciplines including sciences, humanities, education, and law. Within this framework, AI systems undertake a simulated peer-review mechanism akin to traditional scholarly journals, with human organizers serving in an editorial oversight capacity. Within this framework, we organized a competition for the AutoML conference 2023. Entrants are tasked with presenting stand-alone models adept at authoring articles from designated prompts and subsequently appraising them. Assessment criteria include clarity, reference appropriateness, accountability, and the substantive value of the content. This paper presents the design of the competition, including the implementation baseline submissions and methods of evaluation.

Related papers

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
ReviewEval: An Evaluation Framework for AI-Generated Reviews [9.35023998408983]
This research introduces a comprehensive evaluation framework for AI-generated reviews. It measures alignment with human evaluations, verifies factual accuracy, assesses analytical depth, and identifies actionable insights. Our framework establishes standardized metrics for evaluating AI-based review systems.
arXiv Detail & Related papers (2025-02-17T12:22:11Z)
Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation [0.0]
This study presents a framework for automated evaluation of dynamically evolving topic in scientific literature using Large Language Models (LLMs) The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics.
arXiv Detail & Related papers (2025-02-11T08:23:56Z)
Optimizing the role of human evaluation in LLM-based spoken document summarization systems [0.0]
We propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content. We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluations.
arXiv Detail & Related papers (2024-10-23T18:37:14Z)
Analysis of the ICML 2023 Ranking Data: Can Authors' Opinions of Their Own Papers Assist Peer Review in Machine Learning? [52.00419656272129]
We conducted an experiment during the 2023 International Conference on Machine Learning (ICML) We received 1,342 rankings, each from a distinct author, pertaining to 2,592 submissions. We focus on the Isotonic Mechanism, which calibrates raw review scores using author-provided rankings.
arXiv Detail & Related papers (2024-08-24T01:51:23Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance [0.8089605035945486]
We propose RelevAI-Reviewer, an automatic system that conceptualizes the task of survey paper review as a classification problem. We introduce a novel dataset comprised of 25,164 instances. Each instance contains one prompt and four candidate papers, each varying in relevance to the prompt. We develop a machine learning (ML) model capable of determining the relevance of each paper and identifying the most pertinent one.
arXiv Detail & Related papers (2024-06-13T06:42:32Z)
AI and Machine Learning for Next Generation Science Assessments [0.7416846035207727]
This chapter focuses on the transformative role of Artificial Intelligence (AI) and Machine Learning (ML) in science assessments. The paper begins with a discussion of the Framework for K-12 Science Education, which calls for a shift from conceptual learning to knowledge-in-use. The paper achieves three major goals: reviewing the current state of ML-based assessments in science education, introducing a framework for scoring accuracy in ML-based automatic assessments, and discussing future directions and challenges.
arXiv Detail & Related papers (2024-04-23T01:39:20Z)
A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews. The newly emerging AI-generated literature reviews are also appraised. This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z)
Rethinking Word-Level Auto-Completion in Computer-Aided Translation [76.34184928621477]
Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted Translation. It aims at providing word-level auto-completion suggestions for human translators. We introduce a measurable criterion to answer this question and discover that existing WLAC models often fail to meet this criterion. We propose an effective approach to enhance WLAC performance by promoting adherence to the criterion.
arXiv Detail & Related papers (2023-10-23T03:11:46Z)
Evaluating the Generation Capabilities of Large Chinese Language Models [27.598864484231477]
This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework. It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines. Gscore automates the quality measurement of a model's text generation against reference standards.
arXiv Detail & Related papers (2023-08-09T09:22:56Z)
Investigating Fairness Disparities in Peer Review: A Language Model Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs) We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date. We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA) We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli. Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.