QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
- URL: http://arxiv.org/abs/2412.11763v1
- Date: Mon, 16 Dec 2024 13:28:29 GMT
- Title: QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
- Authors: Mohammad Aflah Khan, Neemesh Yadav, Sarah Masud, Md. Shad Akhtar,
- Abstract summary: QUENCH is a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos.
At the intersection of geographical context and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs.
- Score: 22.408857659304484
- License:
- Abstract: The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of geographical context and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis to which the LLMs are prone.
Related papers
- RuAG: Learned-rule-augmented Generation for Large Language Models [62.64389390179651]
We propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules.
We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks.
arXiv Detail & Related papers (2024-11-04T00:01:34Z) - Can LLMs Solve longer Math Word Problems Better? [47.227621867242]
Math Word Problems (MWPs) play a vital role in assessing the capabilities of Large Language Models (LLMs)
The impact of longer contexts on mathematical reasoning remains under-explored.
This study pioneers the investigation of Context Length Generalizability (CoLeG)
arXiv Detail & Related papers (2024-05-23T17:13:50Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models [59.84769254832941]
We propose a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp.
Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment.
Based on FLUB, we investigate the performance of multiple representative and advanced LLMs.
arXiv Detail & Related papers (2024-02-16T22:12:53Z) - Pushing The Limit of LLM Capacity for Text Classification [27.684335455517417]
We propose RGPT, an adaptive boosting framework tailored to produce a specialized text classification LLM.
We show that RGPT significantly outperforms 8 SOTA PLMs and 7 SOTA LLMs on four benchmarks by 1.36% on average.
arXiv Detail & Related papers (2024-02-12T08:14:03Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - ArcMMLU: A Library and Information Science Benchmark for Large Language
Models [25.36473762494066]
This paper introduces ArcMMLU, a benchmark tailored for the Library & Information Science (LIS) domain in Chinese.
This benchmark aims to measure the knowledge and reasoning capability of LLMs within four key sub-domains: Archival Science, Data Science, Library Science, and Information Science.
Our comprehensive evaluation reveals that while most mainstream LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a notable performance gap.
arXiv Detail & Related papers (2023-11-30T16:08:04Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.