Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question
Answering Benchmark
- URL: http://arxiv.org/abs/2402.19248v2
- Date: Sat, 2 Mar 2024 04:37:37 GMT
- Title: Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question
Answering Benchmark
- Authors: Zhikun Xu, Yinghui Li, Ruixue Ding, Xinyu Wang, Boli Chen, Yong Jiang,
Hai-Tao Zheng, Wenlian Lu, Pengjun Xie, Fei Huang
- Abstract summary: We introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet.
We obtain high-quality data through a pipeline that combines humans and models.
We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA.
- Score: 69.3415799675046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to better evaluate the capabilities of Large Language Models (LLMs) is
the focal point and hot topic in current LLMs research. Previous work has noted
that due to the extremely high cost of iterative updates of LLMs, they are
often unable to answer the latest dynamic questions well. To promote the
improvement of Chinese LLMs' ability to answer dynamic questions, in this
paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing
question-answer pairs related to the latest news on the Chinese Internet. We
obtain high-quality data through a pipeline that combines humans and models,
and carefully classify the samples according to the frequency of answer changes
to facilitate a more fine-grained observation of LLMs' capabilities. We have
also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA.
Extensive experiments and valuable insights suggest that our proposed CDQA is
challenging and worthy of more further study. We believe that the benchmark we
provide will become one of the key data resources for improving LLMs' Chinese
question-answering ability in the future.
Related papers
- Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions [10.783827859678892]
We introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark.
This benchmark is derived from existing QA datasets, annotated with proprietary large language models.
It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge.
arXiv Detail & Related papers (2024-11-15T13:12:29Z) - Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models [24.47838086336772]
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions.
We focus on the Chinese language over 6 major topics with 99 diverse subtopics.
arXiv Detail & Related papers (2024-11-11T17:10:56Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model.
We employ a proxy model which has far fewer parameters, and take its answers as answers.
Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z) - Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models [29.202758753639078]
This study investigates the limitations of Multiple Choice Question Answering (MCQA) as an evaluation method for Large Language Models (LLMs)
We propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model.
arXiv Detail & Related papers (2024-02-02T12:07:00Z) - keqing: knowledge-based question answering is a nature chain-of-thought
mentor of LLM [27.76205400533089]
Large language models (LLMs) have exhibited remarkable performance on various natural language processing (NLP) tasks, especially for question answering.
We present a novel framework to assist LLMs, such as ChatGPT, to retrieve question-related structured information on the knowledge graph.
The experimental results on KBQA datasets show that Keqing can achieve competitive performance and illustrate the logic of answering each question.
arXiv Detail & Related papers (2023-12-31T08:39:04Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.